[00:00:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:06:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068886 (owner: 10TrainBranchBot) [00:09:29] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:09:29] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [00:09:29] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:09:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T371742)', diff saved to https://phabricator.wikimedia.org/P68253 and previous config saved to /var/cache/conftool/dbconfig/20240830-000950-ladsgroup.json [00:09:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:10:52] ^ KubernetesDeploymentUnavailableReplicas alert is known issue, patch is available to fix [00:13:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T370903)', diff saved to https://phabricator.wikimedia.org/P68254 and previous config saved to /var/cache/conftool/dbconfig/20240830-001331-ladsgroup.json [00:13:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [00:13:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:13:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [00:13:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T370903)', diff saved to https://phabricator.wikimedia.org/P68255 and previous config saved to /var/cache/conftool/dbconfig/20240830-001353-ladsgroup.json [00:15:13] (03CR) 10Cwhite: [C:03+2] site: add insetup configs for new logging-hd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1062758 (https://phabricator.wikimedia.org/T372511) (owner: 10Cwhite) [00:20:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 918.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:20:40] (03PS4) 10Krinkle: Do not log failed autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 (https://phabricator.wikimedia.org/T373650) (owner: 10Zabe) [00:20:47] (03CR) 10Krinkle: [C:03+1] Do not log failed autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 (https://phabricator.wikimedia.org/T373650) (owner: 10Zabe) [00:22:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T370903)', diff saved to https://phabricator.wikimedia.org/P68258 and previous config saved to /var/cache/conftool/dbconfig/20240830-002239-ladsgroup.json [00:22:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:23:36] (03PS1) 10Scott French: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 [00:23:36] (03PS1) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 [00:23:36] (03PS1) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 [00:23:36] (03PS1) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 [00:24:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P68259 and previous config saved to /var/cache/conftool/dbconfig/20240830-002457-ladsgroup.json [00:25:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 815.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:37:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P68260 and previous config saved to /var/cache/conftool/dbconfig/20240830-003746-ladsgroup.json [00:40:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P68261 and previous config saved to /var/cache/conftool/dbconfig/20240830-004004-ladsgroup.json [00:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:52:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P68262 and previous config saved to /var/cache/conftool/dbconfig/20240830-005254-ladsgroup.json [00:55:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T371742)', diff saved to https://phabricator.wikimedia.org/P68263 and previous config saved to /var/cache/conftool/dbconfig/20240830-005512-ladsgroup.json [00:55:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [00:55:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:55:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [00:55:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68264 and previous config saved to /var/cache/conftool/dbconfig/20240830-005534-ladsgroup.json [01:08:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T370903)', diff saved to https://phabricator.wikimedia.org/P68265 and previous config saved to /var/cache/conftool/dbconfig/20240830-010801-ladsgroup.json [01:08:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [01:08:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:08:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [01:08:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T370903)', diff saved to https://phabricator.wikimedia.org/P68266 and previous config saved to /var/cache/conftool/dbconfig/20240830-010823-ladsgroup.json [01:16:27] 06SRE, 06Data-Persistence, 06serviceops, 07Datacenter-Switchover: Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908#10105107 (10Scott_French) a:03Scott_French [01:17:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T370903)', diff saved to https://phabricator.wikimedia.org/P68267 and previous config saved to /var/cache/conftool/dbconfig/20240830-011721-ladsgroup.json [01:17:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:20:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68268 and previous config saved to /var/cache/conftool/dbconfig/20240830-012044-ladsgroup.json [01:20:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:32:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P68269 and previous config saved to /var/cache/conftool/dbconfig/20240830-013229-ladsgroup.json [01:35:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P68270 and previous config saved to /var/cache/conftool/dbconfig/20240830-013551-ladsgroup.json [01:47:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P68271 and previous config saved to /var/cache/conftool/dbconfig/20240830-014736-ladsgroup.json [01:50:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P68272 and previous config saved to /var/cache/conftool/dbconfig/20240830-015059-ladsgroup.json [02:02:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T370903)', diff saved to https://phabricator.wikimedia.org/P68273 and previous config saved to /var/cache/conftool/dbconfig/20240830-020243-ladsgroup.json [02:02:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [02:02:48] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:02:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [02:03:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68274 and previous config saved to /var/cache/conftool/dbconfig/20240830-020305-ladsgroup.json [02:06:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T371742)', diff saved to https://phabricator.wikimedia.org/P68275 and previous config saved to /var/cache/conftool/dbconfig/20240830-020606-ladsgroup.json [02:06:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:12:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68276 and previous config saved to /var/cache/conftool/dbconfig/20240830-021225-ladsgroup.json [02:12:30] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:27:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P68277 and previous config saved to /var/cache/conftool/dbconfig/20240830-022732-ladsgroup.json [02:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P68278 and previous config saved to /var/cache/conftool/dbconfig/20240830-024239-ladsgroup.json [02:57:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68279 and previous config saved to /var/cache/conftool/dbconfig/20240830-025747-ladsgroup.json [02:57:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2189.codfw.wmnet with reason: Maintenance [02:57:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:58:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2189.codfw.wmnet with reason: Maintenance [02:58:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T370903)', diff saved to https://phabricator.wikimedia.org/P68280 and previous config saved to /var/cache/conftool/dbconfig/20240830-025809-ladsgroup.json [03:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T370903)', diff saved to https://phabricator.wikimedia.org/P68281 and previous config saved to /var/cache/conftool/dbconfig/20240830-030602-ladsgroup.json [03:06:07] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:21:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P68282 and previous config saved to /var/cache/conftool/dbconfig/20240830-032109-ladsgroup.json [03:26:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:31:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:36:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P68283 and previous config saved to /var/cache/conftool/dbconfig/20240830-033616-ladsgroup.json [03:51:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T370903)', diff saved to https://phabricator.wikimedia.org/P68284 and previous config saved to /var/cache/conftool/dbconfig/20240830-035123-ladsgroup.json [03:51:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2197.codfw.wmnet with reason: Maintenance [03:51:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:51:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2197.codfw.wmnet with reason: Maintenance [04:00:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:00:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2207.codfw.wmnet with reason: Maintenance [04:00:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2207.codfw.wmnet with reason: Maintenance [04:00:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68285 and previous config saved to /var/cache/conftool/dbconfig/20240830-040055-ladsgroup.json [04:01:00] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:09:29] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:09:29] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [04:09:29] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:09:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68286 and previous config saved to /var/cache/conftool/dbconfig/20240830-040957-ladsgroup.json [04:10:03] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:25:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P68287 and previous config saved to /var/cache/conftool/dbconfig/20240830-042505-ladsgroup.json [04:40:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P68288 and previous config saved to /var/cache/conftool/dbconfig/20240830-044012-ladsgroup.json [04:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:55:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68289 and previous config saved to /var/cache/conftool/dbconfig/20240830-045519-ladsgroup.json [04:55:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:34:40] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@0321fda]: (no justification provided) [05:35:13] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@0321fda]: (no justification provided) (duration: 00m 32s) [05:45:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 28533472 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:46:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 48056 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240830T0600) [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:34:51] (03Abandoned) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) (owner: 10Arnaudb) [06:40:59] (03CR) 10Slavina Stefanova: [C:03+1] P:toolforge::bastion: Re-install joe [puppet] - 10https://gerrit.wikimedia.org/r/1059451 (https://phabricator.wikimedia.org/T371556) (owner: 10Majavah) [06:55:15] (03CR) 10JMeybohm: [C:03+1] "Nice find, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [06:56:26] (03CR) 10JMeybohm: [C:03+2] global_config: Add pki::multirootca IPs to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1068754 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [06:59:42] (03CR) 10JMeybohm: [C:03+1] sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240830T0700) [07:08:53] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:09:00] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:09:01] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [07:09:24] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:09:25] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:09:36] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:09:38] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:10:08] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:10:10] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:10:22] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:10:24] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:10:56] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:10:57] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:11:30] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:11:31] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:11:46] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:11:47] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:11:57] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:22:27] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 52965 [07:22:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52965 [07:24:06] (03PS1) 10Marostegui: installserver: Do not format db2234, db2236, db2237 [puppet] - 10https://gerrit.wikimedia.org/r/1069106 [07:27:04] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2234, db2236, db2237 [puppet] - 10https://gerrit.wikimedia.org/r/1069106 (owner: 10Marostegui) [07:32:12] (03PS2) 10Alexandros Kosiaris: Rename mw229[567] to wikikube-worker205[234] [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) [07:35:10] (03Abandoned) 10Alexandros Kosiaris: Rename mw229[567] to wikikube-worker205[123] [puppet] - 10https://gerrit.wikimedia.org/r/1068824 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [07:35:24] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw229[567] to wikikube-worker205[234] [puppet] - 10https://gerrit.wikimedia.org/r/1068833 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [07:36:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2295 to wikikube-worker2052 [07:36:56] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [07:37:57] (03CR) 10Alexandros Kosiaris: [C:03+1] k8s-controller-sidecars: adopt securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:39:51] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:41:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:42:11] (03PS1) 10Slyngshede: data.yaml: Extend expiry date for account. [puppet] - 10https://gerrit.wikimedia.org/r/1069113 [07:45:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:45:33] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:46:20] (03CR) 10Slyngshede: [C:03+2] data.yaml Update email address. [puppet] - 10https://gerrit.wikimedia.org/r/1068673 (owner: 10Slyngshede) [07:48:31] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.582s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:50:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:53:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.589s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:54:40] (03PS1) 10Slyngshede: cloudweb2002-dev: Add dummy secrets from IDP on cloudweb2002-dev. [labs/private] - 10https://gerrit.wikimedia.org/r/1069114 [07:56:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:00:18] (03PS2) 10Slyngshede: cloudweb2002-dev: Add dummy secrets for IDP on cloudweb2002-dev. [labs/private] - 10https://gerrit.wikimedia.org/r/1069114 [08:00:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:04:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 525, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:05:31] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:07:07] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:09:29] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:09:29] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [08:09:29] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:18:49] PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:22] (03PS3) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [08:21:40] (03CR) 10CI reject: [V:04-1] R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [08:23:12] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2295 to wikikube-worker2052 - akosiaris@cumin1002" [08:23:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2295 to wikikube-worker2052 - akosiaris@cumin1002" [08:23:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:23:43] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2052 [08:24:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2052 [08:24:38] (03PS2) 10Klausman: manifests: move new GPU hosts in eqiad from insetup to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) [08:24:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2295 to wikikube-worker2052 [08:25:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2295 to... [08:26:16] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2296 to wikikube-worker2053 [08:26:37] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [08:27:40] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3788/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [08:28:23] (03PS2) 10Dreamy Jazz: Remove wgCheckUserPurgeOldClientHintsData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) [08:28:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) (owner: 10Dreamy Jazz) [08:29:31] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:33:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:38] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2296 to wikikube-worker2053 - akosiaris@cumin1002" [08:36:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2296 to wikikube-worker2053 - akosiaris@cumin1002" [08:36:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:00] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2053 [08:36:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2053 [08:36:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2296 to wikikube-worker2053 [08:37:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2296 to wikikube-worker2053 c... [08:37:12] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2297 to wikikube-worker2054 [08:37:29] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [08:42:04] (03CR) 10Elukey: [C:03+1] data.yaml: Extend expiry date for account. [puppet] - 10https://gerrit.wikimedia.org/r/1069113 (owner: 10Slyngshede) [08:42:28] (03CR) 10Elukey: [C:03+1] cloudweb2002-dev: Add dummy secrets for IDP on cloudweb2002-dev. [labs/private] - 10https://gerrit.wikimedia.org/r/1069114 (owner: 10Slyngshede) [08:43:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:49] (03PS1) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [08:46:10] (03CR) 10Elukey: [C:03+2] services: update Thumbor's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068819 (https://phabricator.wikimedia.org/T373618) (owner: 10Elukey) [08:47:42] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@3d18901] (releasing): (no justification provided) [08:48:02] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@3d18901] (releasing): (no justification provided) (duration: 00m 20s) [08:48:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:41] (03PS3) 10Arnaudb: mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) [08:49:06] (03CR) 10Arnaudb: mysql: replication lag monitoring threshold and severity change (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [08:49:28] (03PS2) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [08:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:49:47] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [08:50:05] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@3d18901] (releasing): (no justification provided) [08:50:21] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [08:50:24] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [08:50:46] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@3d18901] (releasing): (no justification provided) (duration: 00m 41s) [08:51:10] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:43] (03CR) 10CI reject: [V:04-1] ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [08:52:55] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2297 to wikikube-worker2054 - akosiaris@cumin1002" [08:53:40] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:55:25] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2297 to wikikube-worker2054 - akosiaris@cumin1002" [08:55:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:29] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2054 [08:55:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2054 [08:56:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2297 to wikikube-worker2054 [08:56:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2297 to wikikube-worker2054 c... [08:58:03] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3789/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [08:58:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:20] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2052.codfw.wmnet with OS bullseye [08:59:30] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2052 [08:59:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2052.co... [08:59:34] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [09:00:11] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:00:28] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:23] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2053.codfw.wmnet with OS bullseye [09:02:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2053.co... [09:02:42] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2052 - akosiaris@cumin1002" [09:02:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2052 - akosiaris@cumin1002" [09:02:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:02:47] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2052.codfw.wmnet 165.0.192.10.in-addr.arpa 5.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:02:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2052.codfw.wmnet 165.0.192.10.in-addr.arpa 5.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:02:50] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2052 [09:03:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2052 [09:03:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2052 [09:03:03] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2053 [09:03:13] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [09:03:38] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:00] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2054.codfw.wmnet with OS bullseye [09:04:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2054.co... [09:05:20] PROBLEM - Host mw2295 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:48] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:06:08] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:06:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:06:23] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2053 - akosiaris@cumin1002" [09:06:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2053 - akosiaris@cumin1002" [09:06:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:06:27] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2053.codfw.wmnet 166.0.192.10.in-addr.arpa 6.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:06:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2053.codfw.wmnet 166.0.192.10.in-addr.arpa 6.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:06:31] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2053 [09:06:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2053 [09:06:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2053 [09:07:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2054 [09:07:17] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [09:07:30] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:40] FIRING: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:23] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2054 - akosiaris@cumin1002" [09:10:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2054 - akosiaris@cumin1002" [09:10:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:10:28] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2054.codfw.wmnet 167.0.192.10.in-addr.arpa 7.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:10:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2054.codfw.wmnet 167.0.192.10.in-addr.arpa 7.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:10:32] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2054 [09:10:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2054 [09:10:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2054 [09:12:04] (03CR) 10JMeybohm: sre.k8s.renumber-node: vlan, IP change k8s workers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:12:44] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3790/console" [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [09:13:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:42] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:18:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:30] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage [09:21:21] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [09:22:39] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:22:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage [09:22:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage [09:23:09] RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:23:40] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage [09:27:02] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage [09:27:02] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [09:31:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage [09:33:45] (03CR) 10JMeybohm: [C:03+2] Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:34:18] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3791/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [09:34:48] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:02] (03CR) 10JMeybohm: [C:03+1] jaeger: add securityContext configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [09:37:24] (03Merged) 10jenkins-bot: Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:38:03] (03CR) 10JMeybohm: [C:03+2] "It does. We're using that for generic http probes (in service::catalog) already." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [09:39:06] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [09:39:13] (03Merged) 10jenkins-bot: eventgate: Offer readinessProbe that does not test kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [09:42:26] (03PS1) 10Elukey: services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069123 (https://phabricator.wikimedia.org/T373665) [09:42:36] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:42:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2052.codfw.wmnet with OS bullseye [09:42:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105549 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-worker2052.codfw.... [09:43:13] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:43:34] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [09:44:34] (03CR) 10Elukey: [C:03+2] services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069123 (https://phabricator.wikimedia.org/T373665) (owner: 10Elukey) [09:44:36] RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2053.codfw.wmnet with OS bullseye [09:45:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-worker2053.codfw.... [09:47:18] (03CR) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:48:01] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10105554 (10Jelto) [09:48:14] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10105555 (10Jelto) [09:48:14] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [09:48:24] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10105557 (10Jelto) [09:48:32] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10105559 (10Jelto) [09:48:40] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:45] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10105560 (10Jelto) [09:49:44] (03PS3) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [09:51:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2054.codfw.wmnet with OS bullseye [09:51:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10105563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wikikube-worker2054.codfw.... [09:52:34] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T373666 (10zoe) 03NEW [09:52:54] (03CR) 10CI reject: [V:04-1] ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [09:52:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10105577 (10zoe) [09:55:40] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/proton: sync [09:56:34] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: sync [09:56:47] (03PS4) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [09:58:15] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/proton: sync [09:59:23] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: sync [10:01:09] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-node: Check calico and fix phab [cookbooks] - 10https://gerrit.wikimedia.org/r/1068007 (owner: 10Clément Goubert) [10:01:31] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:04:22] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync [10:06:08] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync [10:10:30] (03PS5) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [10:11:53] (03CR) 10Brouberol: [C:03+1] "Looks good, and the service exists" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068768 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [10:15:07] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend expiry date for account. [puppet] - 10https://gerrit.wikimedia.org/r/1069113 (owner: 10Slyngshede) [10:16:06] (03CR) 10Slyngshede: [V:03+2 C:03+2] cloudweb2002-dev: Add dummy secrets for IDP on cloudweb2002-dev. [labs/private] - 10https://gerrit.wikimedia.org/r/1069114 (owner: 10Slyngshede) [10:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:19:57] (03PS4) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [10:25:00] (03CR) 10Jaime Nuche: "Following up on this. In the end we changed the `jenkins-deploy` deployment repo so that in the future only the change in puppet is necess" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:27:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2052.codfw.wmnet [10:27:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2052.codfw.wmnet [10:27:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2053.codfw.wmnet [10:27:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2053.codfw.wmnet [10:27:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2054.codfw.wmnet [10:27:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2054.codfw.wmnet [10:28:16] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw2295,mw2296,mw2297 - https://phabricator.wikimedia.org/T373669 (10akosiaris) 03NEW [10:31:29] (03PS5) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [10:32:16] (03PS6) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [10:32:59] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:34:30] (03PS6) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [10:35:24] (03CR) 10CI reject: [V:04-1] ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:38:26] (03PS7) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [10:39:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10105707 (10cmooney) 05Open→03Resolved Updated config is applied on asw2-ulsfo since yesterday and not showing signs of problems.... [10:39:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10105731 (10cmooney) This all is working fine thank you @Jhancock.wm Unless there is an issue I'll leave this task open for tidy-up when we a... [10:44:00] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2381.codfw.wmnet [10:44:32] !log restart swift-proxy on ms-fe2009 and ms-fe2014 T360913 [10:44:34] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2381.codfw.wmnet [10:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:36] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [10:45:43] (03PS1) 10Hnowlan: k8s: rename mw2381 to wikikube-worker2055 [puppet] - 10https://gerrit.wikimedia.org/r/1069144 (https://phabricator.wikimedia.org/T372878) [10:52:31] (03PS9) 10Ladsgroup: mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) [10:52:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Add schema file and test for tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1068817 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [10:53:42] (03PS1) 10Jgiannelos: changeprop: Update references to latest beta restbase node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069145 (https://phabricator.wikimedia.org/T370460) [10:56:10] (03PS1) 10Jgiannelos: Update references to latest beta restbase node [puppet] - 10https://gerrit.wikimedia.org/r/1069148 (https://phabricator.wikimedia.org/T370460) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240830T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240830T1100). Please do the needful. [11:00:46] (03Abandoned) 10Ladsgroup: [DNM] Test the table schema [puppet] - 10https://gerrit.wikimedia.org/r/1068818 (owner: 10Ladsgroup) [11:03:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:03:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:03:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T370903)', diff saved to https://phabricator.wikimedia.org/P68290 and previous config saved to /var/cache/conftool/dbconfig/20240830-110334-ladsgroup.json [11:03:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:03:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:04:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:04:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:04:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:04:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T371742)', diff saved to https://phabricator.wikimedia.org/P68291 and previous config saved to /var/cache/conftool/dbconfig/20240830-110426-ladsgroup.json [11:04:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:11:44] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw2381 to wikikube-worker2055 [puppet] - 10https://gerrit.wikimedia.org/r/1069144 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:22:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T370903)', diff saved to https://phabricator.wikimedia.org/P68292 and previous config saved to /var/cache/conftool/dbconfig/20240830-112159-ladsgroup.json [11:22:04] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:22:39] (03CR) 10Hnowlan: [C:03+2] k8s: rename mw2381 to wikikube-worker2055 [puppet] - 10https://gerrit.wikimedia.org/r/1069144 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:24:55] (03PS5) 10Clément Goubert: interactive: Ring the bell by default in ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 [11:24:55] (03CR) 10Clément Goubert: "CI failure seems unrelated" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [11:28:47] !log hnowlan@cumin2002 START - Cookbook sre.hosts.rename from mw2381 to wikikube-worker2055 [11:29:06] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [11:34:14] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2381 to wikikube-worker2055 - hnowlan@cumin2002" [11:35:08] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2381 to wikikube-worker2055 - hnowlan@cumin2002" [11:35:09] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:10] !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2055 [11:35:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T371742)', diff saved to https://phabricator.wikimedia.org/P68293 and previous config saved to /var/cache/conftool/dbconfig/20240830-113544-ladsgroup.json [11:35:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:37:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P68294 and previous config saved to /var/cache/conftool/dbconfig/20240830-113706-ladsgroup.json [11:39:42] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2055 [11:40:23] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2381 to wikikube-worker2055 [11:40:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin2002 from mw2381 to wikikube-worker2055 com... [11:41:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2055.codfw.wmnet with OS bullseye [11:42:09] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2055 [11:42:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2055.codf... [11:42:52] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:46:21] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2055 - hnowlan@cumin1002" [11:46:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2055 - hnowlan@cumin1002" [11:46:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:25] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2055.codfw.wmnet 44.0.192.10.in-addr.arpa 4.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:46:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2055.codfw.wmnet 44.0.192.10.in-addr.arpa 4.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:46:29] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2055 [11:46:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2055 [11:46:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2055 [11:50:00] (03PS1) 10Hnowlan: k8s: rename mw2382 to wikikube-worker2056 [puppet] - 10https://gerrit.wikimedia.org/r/1069151 (https://phabricator.wikimedia.org/T372878) [11:50:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P68295 and previous config saved to /var/cache/conftool/dbconfig/20240830-115052-ladsgroup.json [11:52:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P68296 and previous config saved to /var/cache/conftool/dbconfig/20240830-115213-ladsgroup.json [11:52:48] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw2382 to wikikube-worker2056 [puppet] - 10https://gerrit.wikimedia.org/r/1069151 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:55:27] (03CR) 10Hnowlan: [C:03+2] k8s: rename mw2382 to wikikube-worker2056 [puppet] - 10https://gerrit.wikimedia.org/r/1069151 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:56:50] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2382.codfw.wmnet [11:57:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2382.codfw.wmnet [11:59:16] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 435, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:48] (03CR) 10Slyngshede: [V:03+2 C:03+2] Fix syntax error [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1067988 (owner: 10Slyngshede) [12:00:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2382 to wikikube-worker2056 [12:01:08] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:02:55] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage [12:04:39] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2382 to wikikube-worker2056 - hnowlan@cumin1002" [12:06:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P68297 and previous config saved to /var/cache/conftool/dbconfig/20240830-120559-ladsgroup.json [12:06:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage [12:07:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T370903)', diff saved to https://phabricator.wikimedia.org/P68298 and previous config saved to /var/cache/conftool/dbconfig/20240830-120720-ladsgroup.json [12:07:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:07:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:07:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:07:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T370903)', diff saved to https://phabricator.wikimedia.org/P68299 and previous config saved to /var/cache/conftool/dbconfig/20240830-120742-ladsgroup.json [12:08:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2382 to wikikube-worker2056 - hnowlan@cumin1002" [12:08:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:08:44] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2056 [12:09:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2056 [12:09:29] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:09:29] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [12:09:29] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:09:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2382 to wikikube-worker2056 [12:10:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2382 to w... [12:11:50] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2055.codfw.wmnet on all recursors [12:11:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2055.codfw.wmnet on all recursors [12:12:20] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2056.codfw.wmnet on all recursors [12:12:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2056.codfw.wmnet on all recursors [12:13:00] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2056.codfw.wmnet with OS bullseye [12:13:09] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2056 [12:13:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [12:13:17] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:13:32] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 517, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:39] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2377.codfw.wmnet [12:16:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2377.codfw.wmnet [12:16:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2378.codfw.wmnet [12:17:02] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2056 - hnowlan@cumin1002" [12:17:02] (03PS4) 10Arnaudb: mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) [12:17:03] (03CR) 10Arnaudb: "thanks for the feedback, hopefully this PS covers everything" [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [12:17:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2056 - hnowlan@cumin1002" [12:17:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:07] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2056.codfw.wmnet 45.0.192.10.in-addr.arpa 5.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:17:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2056.codfw.wmnet 45.0.192.10.in-addr.arpa 5.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:17:10] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2056 [12:17:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2056 [12:17:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2056 [12:19:20] (03PS1) 10Alexandros Kosiaris: Rename mw237[789] to wikikube-worker205[789] [puppet] - 10https://gerrit.wikimedia.org/r/1069164 (https://phabricator.wikimedia.org/T372878) [12:19:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2378.codfw.wmnet [12:19:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2379.codfw.wmnet [12:20:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2379.codfw.wmnet [12:20:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T371742)', diff saved to https://phabricator.wikimedia.org/P68300 and previous config saved to /var/cache/conftool/dbconfig/20240830-122106-ladsgroup.json [12:21:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:21:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:21:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:21:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68301 and previous config saved to /var/cache/conftool/dbconfig/20240830-122139-ladsgroup.json [12:24:44] !log homer 'lsw1-a3-codfw*' commit [12:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:31] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2055.codfw.wmnet with OS bullseye [12:25:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106126 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [12:26:34] PROBLEM - Host mw2296 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:46] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2055.codfw.wmnet [12:27:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2055.codfw.wmnet [12:27:52] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:28:59] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106131 (10hnowlan) [12:29:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:31:30] (03PS1) 10Slyngshede: P:idp Reallow CAS 6.6 to be installed. [puppet] - 10https://gerrit.wikimedia.org/r/1069165 [12:32:38] (03PS7) 10Slyngshede: R:codfw1dev:cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [12:33:41] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3795/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:33:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage [12:35:51] (03CR) 10JMeybohm: sre.k8s.renumber-node: vlan, IP change k8s workers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [12:36:44] (03PS8) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [12:37:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage [12:39:02] (03CR) 10Andrew Bogott: [C:03+1] P:idp Reallow CAS 6.6 to be installed. [puppet] - 10https://gerrit.wikimedia.org/r/1069165 (owner: 10Slyngshede) [12:39:34] (03PS9) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [12:40:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3796/co" [puppet] - 10https://gerrit.wikimedia.org/r/1069165 (owner: 10Slyngshede) [12:41:29] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10106174 (10ssingh) [12:41:46] (03PS1) 10Stevemunene: Update airflow-test-k8s image to include authlib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069166 (https://phabricator.wikimedia.org/T368760) [12:42:07] (03CR) 10Cathal Mooney: [C:03+1] "LGTM as long as traffic are happy with it!" [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [12:42:35] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw237[789] to wikikube-worker205[789] [puppet] - 10https://gerrit.wikimedia.org/r/1069164 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [12:42:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10106173 (10ssingh) @thcipriani: This requires your approval as well, in addition to @VPuffetMichel. Thanks! [12:46:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T370903)', diff saved to https://phabricator.wikimedia.org/P68302 and previous config saved to /var/cache/conftool/dbconfig/20240830-124617-ladsgroup.json [12:46:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:47:55] (03PS10) 10Andrew Bogott: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:47:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:52:00] (03PS2) 10Slyngshede: P:idp Reallow CAS 6.6 to be installed. [puppet] - 10https://gerrit.wikimedia.org/r/1069165 [12:52:50] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3797/console" [puppet] - 10https://gerrit.wikimedia.org/r/1069165 (owner: 10Slyngshede) [12:53:58] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:27] (03PS11) 10Andrew Bogott: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:56:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2056.codfw.wmnet with OS bullseye [12:57:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106208 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [12:58:01] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3798/console" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:59:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2377 to wikikube-worker2057 [12:59:32] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:00:35] (03PS1) 10Ssingh: admin: add zoe to deployment (move from ldap_only_users) [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) [13:01:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P68303 and previous config saved to /var/cache/conftool/dbconfig/20240830-130124-ladsgroup.json [13:02:48] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2377 to wikikube-worker2057 - akosiaris@cumin1002" [13:04:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2377 to wikikube-worker2057 - akosiaris@cumin1002" [13:04:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:06] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2057 [13:04:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2057 [13:04:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2377 to wikikube-worker2057 [13:05:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2377 to... [13:08:24] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10106218 (10elukey) After checking the JVM's [[ https://grafana-rw.wikimedia.org/d/e0f6afe3-2aea-483d-9f5e-55f0cba9207f/puppetserver?orgId=1&... [13:13:31] (03PS1) 10Elukey: profile::puppetserver: set java_start_mem to 40g [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) [13:14:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3799/co" [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:15:44] (03PS1) 10JMeybohm: Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) [13:16:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P68304 and previous config saved to /var/cache/conftool/dbconfig/20240830-131631-ladsgroup.json [13:18:36] (03PS2) 10JMeybohm: Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) [13:21:28] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2003.codfw.wmnet [13:21:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2003.codfw.wmnet [13:21:34] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2378 to wikikube-worker2058 [13:21:51] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:26:15] (03PS3) 10JMeybohm: Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) [13:26:41] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2003.codfw.wmnet [13:26:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2003.codfw.wmnet [13:27:04] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2001.codfw.wmnet [13:27:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker2001.codfw.wmnet [13:27:16] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2378 to wikikube-worker2058 - akosiaris@cumin1002" [13:27:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68305 and previous config saved to /var/cache/conftool/dbconfig/20240830-132750-ladsgroup.json [13:27:55] (03CR) 10Bking: [C:03+1] manifests: move new GPU hosts in eqiad from insetup to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [13:27:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:31:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2378 to wikikube-worker2058 - akosiaris@cumin1002" [13:31:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:34] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2058 [13:31:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T370903)', diff saved to https://phabricator.wikimedia.org/P68306 and previous config saved to /var/cache/conftool/dbconfig/20240830-133139-ladsgroup.json [13:31:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:31:45] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:31:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:32:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T370903)', diff saved to https://phabricator.wikimedia.org/P68307 and previous config saved to /var/cache/conftool/dbconfig/20240830-133201-ladsgroup.json [13:32:38] (03CR) 10Ssingh: "Sounds like it's worth a shot, let me know if you want to merge today 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:32:51] (03CR) 10Ssingh: [C:03+1] profile::puppetserver: set java_start_mem to 40g [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:33:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2058 [13:33:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2378 to wikikube-worker2058 [13:34:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2379 to wikikube-worker2059 [13:34:25] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:35:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2378 to... [13:35:55] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2003.codfw.wmnet [13:35:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2003.codfw.wmnet [13:37:26] 06SRE, 06Anti-Harassment, 06DBA: Error Unknown column ipb_sitewide in field list on query - https://phabricator.wikimedia.org/T208462#10106248 (10Lafeber) I upgraded to 1.42 from a very early version and got the same error. I ran the manual SQL that @DonPaolo mentioned (thank you!) and I presume it was... [13:38:20] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [13:38:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [13:38:40] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [13:38:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [13:40:23] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2003.codfw.wmnet [13:40:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2003.codfw.wmnet [13:41:54] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2001.codfw.wmnet [13:41:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2001.codfw.wmnet [13:42:12] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2379 to wikikube-worker2059 - akosiaris@cumin1002" [13:42:55] (03CR) 10Klausman: [V:03+1 C:03+2] manifests: move new GPU hosts in eqiad from insetup to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1068657 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [13:42:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P68308 and previous config saved to /var/cache/conftool/dbconfig/20240830-134257-ladsgroup.json [13:43:01] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2001.codfw.wmnet [13:43:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2001.codfw.wmnet [13:43:37] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bullseye [13:43:46] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2001.codfw.wmnet with OS bullseye [13:43:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [13:43:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106272 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [13:45:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2379 to wikikube-worker2059 - akosiaris@cumin1002" [13:45:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:22] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2059 [13:45:33] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl2001.codfw.wmnet [13:45:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl2001.codfw.wmnet [13:45:38] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl2003.codfw.wmnet [13:45:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl2003.codfw.wmnet [13:45:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2059 [13:46:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2379 to wikikube-worker2059 [13:46:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2379 to... [13:46:44] (03CR) 10Brouberol: [C:03+1] Update airflow-test-k8s image to include authlib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069166 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [13:48:30] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:49:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T370903)', diff saved to https://phabricator.wikimedia.org/P68309 and previous config saved to /var/cache/conftool/dbconfig/20240830-134954-ladsgroup.json [13:49:59] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:51:40] FIRING: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:08] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Reallow CAS 6.6 to be installed. [puppet] - 10https://gerrit.wikimedia.org/r/1069165 (owner: 10Slyngshede) [13:52:33] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2057.codfw.wmnet with OS bullseye [13:52:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [13:52:47] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2057 [13:53:04] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2058.codfw.wmnet with OS bullseye [13:53:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106321 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [13:53:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:53:33] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2059.codfw.wmnet with OS bullseye [13:53:36] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:53:37] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2059.codfw.wmnet with OS bullseye [13:53:40] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [13:53:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [13:54:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2059.codfw.wmnet with OS bullseye [13:55:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [13:56:40] RESOLVED: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:47] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2057 - akosiaris@cumin1002" [13:56:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2057 - akosiaris@cumin1002" [13:56:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:52] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2057.codfw.wmnet 40.0.192.10.in-addr.arpa 0.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:56:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2057.codfw.wmnet 40.0.192.10.in-addr.arpa 0.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:56:56] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2057 [13:58:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P68310 and previous config saved to /var/cache/conftool/dbconfig/20240830-135804-ladsgroup.json [13:58:10] (03CR) 10Tiziano Fogli: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3800/co" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [13:58:24] (03CR) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:58:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2057 [13:58:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2057 [13:58:45] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2058 [13:59:04] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:59:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3801/console" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [13:59:55] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1009 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:02:22] (03PS12) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:03:02] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [14:04:17] (03PS15) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 [14:05:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P68311 and previous config saved to /var/cache/conftool/dbconfig/20240830-140501-ladsgroup.json [14:05:51] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2056.codfw.wmnet [14:05:52] (03CR) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [14:05:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2056.codfw.wmnet [14:06:16] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2058 - akosiaris@cumin1002" [14:06:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2058 - akosiaris@cumin1002" [14:06:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:21] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2058.codfw.wmnet 41.0.192.10.in-addr.arpa 1.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:06:24] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2058.codfw.wmnet 41.0.192.10.in-addr.arpa 1.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:06:25] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2058 [14:06:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2058 [14:06:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2058 [14:07:28] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [14:11:48] (03PS1) 10Hnowlan: k8s: rename mw238[345] to wikikube-worker206[012] [puppet] - 10https://gerrit.wikimedia.org/r/1069214 (https://phabricator.wikimedia.org/T372878) [14:11:50] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage [14:13:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68312 and previous config saved to /var/cache/conftool/dbconfig/20240830-141311-ladsgroup.json [14:13:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:13:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:13:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:13:40] FIRING: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage [14:15:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage [14:16:33] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:17:16] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw238[345] to wikikube-worker206[012] [puppet] - 10https://gerrit.wikimedia.org/r/1069214 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [14:17:17] (03PS1) 10Andrew Bogott: Fake secrets for idp redirect on cloudcontrols [labs/private] - 10https://gerrit.wikimedia.org/r/1069217 [14:17:36] (03PS2) 10Andrew Bogott: Fake secrets for idp redirect on cloudcontrols [labs/private] - 10https://gerrit.wikimedia.org/r/1069217 [14:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:18:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage [14:18:42] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fake secrets for idp redirect on cloudcontrols [labs/private] - 10https://gerrit.wikimedia.org/r/1069217 (owner: 10Andrew Bogott) [14:19:55] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:20:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P68313 and previous config saved to /var/cache/conftool/dbconfig/20240830-142008-ladsgroup.json [14:22:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage [14:23:30] (03PS2) 10Clément Goubert: k8s: rename mw238[345] to wikikube-worker206[012] [puppet] - 10https://gerrit.wikimedia.org/r/1069214 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [14:23:30] (03PS1) 10Clément Goubert: kubernetes: Rename last appserver in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1069223 (https://phabricator.wikimedia.org/T351074) [14:24:24] (03PS2) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) [14:24:24] (03PS2) 10Elukey: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) [14:24:24] (03PS1) 10Elukey: tox: add config for jenkins [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069224 (https://phabricator.wikimedia.org/T372485) [14:24:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage [14:24:55] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:25:55] FIRING: KubernetesRsyslogDown: rsyslog on ml-serve1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1009 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:26:24] (03CR) 10Hnowlan: [C:03+1] kubernetes: Rename last appserver in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1069223 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:26:30] (03PS2) 10Nik Gkountas: admin: add new ssh key for ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) [14:27:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 429, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:00] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2383.codfw.wmnet [14:28:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2383.codfw.wmnet [14:28:45] (03PS1) 10Klausman: BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) [14:29:38] (03CR) 10Clément Goubert: [C:03+1] Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [14:30:09] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2384.codfw.wmnet [14:30:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2384.codfw.wmnet [14:30:56] (03CR) 10Nik Gkountas: admin: add new ssh key for ngkountas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) (owner: 10Nik Gkountas) [14:31:08] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2385.codfw.wmnet [14:31:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2385.codfw.wmnet [14:32:09] (03PS2) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 [14:33:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10106462 (10ngkountas) @ssingh sorry for the repeated mistake. I uploaded a different one that is not listed in `idm.wikimedia.org` keys. Thank you! [14:33:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2059.codfw.wmnet with OS bullseye [14:34:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [14:35:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T370903)', diff saved to https://phabricator.wikimedia.org/P68314 and previous config saved to /var/cache/conftool/dbconfig/20240830-143516-ladsgroup.json [14:35:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1195.eqiad.wmnet with reason: Maintenance [14:35:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:35:24] (03CR) 10Ssingh: [C:03+2] admin: add new ssh key for ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1065216 (https://phabricator.wikimedia.org/T371372) (owner: 10Nik Gkountas) [14:35:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1195.eqiad.wmnet with reason: Maintenance [14:35:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T370903)', diff saved to https://phabricator.wikimedia.org/P68315 and previous config saved to /var/cache/conftool/dbconfig/20240830-143537-ladsgroup.json [14:36:12] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:22] (03PS3) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [14:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:37:15] (03CR) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:37:56] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename last appserver in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1069223 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:37:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2057.codfw.wmnet with OS bullseye [14:38:06] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 511, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [14:38:24] (03CR) 10Clément Goubert: [C:03+2] k8s: rename mw238[345] to wikikube-worker206[012] [puppet] - 10https://gerrit.wikimedia.org/r/1069214 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [14:40:12] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:40:24] (03CR) 10Elukey: "One small nit!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [14:40:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2299 to wikikube-worker2063 [14:40:53] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2383 to wikikube-worker2060 [14:40:55] RESOLVED: KubernetesRsyslogDown: rsyslog on ml-serve1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1009 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:41:01] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:41:30] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:41:45] (03PS2) 10Klausman: BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) [14:41:50] (03CR) 10Klausman: BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [14:42:12] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10106475 (10ssingh) 05Open→03Resolved a:03ssingh Thanks @ngkountas; key updated. [14:43:40] RESOLVED: SystemdUnitFailed: dragonfly-dfdaemon.service on ml-serve1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:45] (03CR) 10Bking: [C:03+2] Update airflow-test-k8s image to include authlib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069166 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [14:44:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2384 to wikikube-worker2061 [14:44:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2385 to wikikube-worker2062 [14:44:42] (03Merged) 10jenkins-bot: Update airflow-test-k8s image to include authlib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069166 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [14:44:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2058.codfw.wmnet with OS bullseye [14:45:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [14:45:03] (03PS1) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [14:46:09] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:46:27] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2299 to wikikube-worker2063 - cgoubert@cumin1002" [14:47:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2299 to wikikube-worker2063 - cgoubert@cumin1002" [14:47:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:43] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2063 [14:47:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2063 [14:48:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2299 to wikikube-worker2063 [14:48:16] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2299 to... [14:48:58] (03CR) 10Hashar: "That one largely speeds up tox flake8 environments :) The same should be done for `style` and `format`." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [14:49:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2063.codfw.wmnet with OS bullseye [14:49:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2063 [14:49:27] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:49:33] (03PS1) 10Hashar: tox: run less environments on CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069220 (https://phabricator.wikimedia.org/T372485) [14:49:35] (03CR) 10Scott French: "Thanks for the review, all!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:49:37] (03CR) 10Scott French: [C:03+2] k8s-controller-sidecars: adopt securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:50:19] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2385 to wikikube-worker2062 - hnowlan@cumin1002" [14:50:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2385 to wikikube-worker2062 - hnowlan@cumin1002" [14:50:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:25] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2062 [14:50:32] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:50:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2062 [14:51:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2385 to wikikube-worker2062 [14:51:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2385 to w... [14:53:08] (03Merged) 10jenkins-bot: k8s-controller-sidecars: adopt securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068869 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:53:43] (03CR) 10Elukey: [C:03+1] BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [14:53:58] (03CR) 10Klausman: [C:03+2] BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [14:54:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T370903)', diff saved to https://phabricator.wikimedia.org/P68316 and previous config saved to /var/cache/conftool/dbconfig/20240830-145442-ladsgroup.json [14:54:48] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:55:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10106494 (10thcipriani) >>! In T373666#10106172, @ssingh wrote: > @thcipriani: This requires your approval as well, in addition to @VPuffetMichel. Thanks! Approved. More Cito... [14:55:43] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069224 (https://phabricator.wikimedia.org/T372485) (owner: 10Elukey) [14:56:06] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:56:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10106495 (10ssingh) [14:56:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:57:31] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2384 to wikikube-worker2061 - hnowlan@cumin1002" [14:57:34] (03Merged) 10jenkins-bot: BGP peers: add lsw1-e5-eqiad and lsw1-f5-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069225 (https://phabricator.wikimedia.org/T372432) (owner: 10Klausman) [14:57:35] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:57:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2384 to wikikube-worker2061 - hnowlan@cumin1002" [14:57:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:37] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2061 [14:57:54] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:58:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2061 [14:58:15] (03PS2) 10Elukey: tox: run less environments on CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069220 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [14:58:25] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:58:38] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:58:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:42] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2063.codfw.wmnet 169.0.192.10.in-addr.arpa 9.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:58:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2063.codfw.wmnet 169.0.192.10.in-addr.arpa 9.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:58:46] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2063 [14:58:50] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2384 to wikikube-worker2061 [14:58:51] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:59:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2384 to w... [14:59:14] (03Abandoned) 10Elukey: tox: add config for jenkins [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069224 (https://phabricator.wikimedia.org/T372485) (owner: 10Elukey) [15:00:26] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:00:26] (03PS1) 10Clément Goubert: kubernetes: Rename last appserver in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1069227 (https://phabricator.wikimedia.org/T351074) [15:00:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2063 [15:00:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2063 [15:01:21] (03PS2) 10Clément Goubert: kubernetes: Rename last appserver in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1069227 (https://phabricator.wikimedia.org/T351074) [15:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:45] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2060 [15:04:38] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2060 [15:05:18] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2383 to wikikube-worker2060 [15:05:28] (03CR) 10Hnowlan: [C:03+1] kubernetes: Rename last appserver in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1069227 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:05:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2383 to w... [15:05:38] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename last appserver in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1069227 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:06:05] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:06:52] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:07:13] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2062.codfw.wmnet with OS bullseye [15:07:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2061.codfw.wmnet with OS bullseye [15:07:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2060.codfw.wmnet with OS bullseye [15:07:21] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:07:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2062 [15:07:24] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2060.codfw.wmnet with OS bullseye [15:07:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [15:07:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [15:07:33] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:07:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [15:07:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [15:07:51] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:07:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2060.codfw.wmnet with OS bullseye [15:07:59] (03PS1) 10Thcipriani: Admin data matrix: show ldap_only_users, too [puppet] - 10https://gerrit.wikimedia.org/r/1069229 [15:08:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [15:08:02] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:08:31] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:08:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1398 to wikikube-worker1033 [15:08:51] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:08:59] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:08:59] Deployment k8s-controller-sidecars in sidecar-controller at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=sidecar-controller&var-deployment=k8s-controller-sidecars - ... [15:08:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:09:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P68317 and previous config saved to /var/cache/conftool/dbconfig/20240830-150950-ladsgroup.json [15:10:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:11:11] (03CR) 10Elukey: [C:03+2] tox: run less environments on CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069220 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [15:11:17] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:11:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68318 and previous config saved to /var/cache/conftool/dbconfig/20240830-151128-ladsgroup.json [15:11:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:11:38] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:11:43] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:12:17] !log klausman@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:12:37] !log klausman@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:13:07] (03CR) 10Thcipriani: "Adding sukhe since the context of this one is this deployment access request: https://phabricator.wikimedia.org/T373666" [puppet] - 10https://gerrit.wikimedia.org/r/1069229 (owner: 10Thcipriani) [15:13:38] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2062 - hnowlan@cumin1002" [15:13:49] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:15:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2062 - hnowlan@cumin1002" [15:15:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:06] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2062.codfw.wmnet 48.0.192.10.in-addr.arpa 8.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:15:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2062.codfw.wmnet 48.0.192.10.in-addr.arpa 8.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:15:13] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2062 [15:15:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2062 [15:15:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2062 [15:16:09] 10ops-drmrs: determine cable ID for CRT-008647 - https://phabricator.wikimedia.org/T369951#10106552 (10RobH) 05Open→03Resolved There was no label so they slapped 'CRT-008647' on there since I had advised that was the ID of the circuit. (I had that note from it being on there potentially during install).... [15:16:15] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2061 [15:17:03] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1398 to wikikube-worker1033 - cgoubert@cumin1002" [15:17:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage [15:17:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1398 to wikikube-worker1033 - cgoubert@cumin1002" [15:17:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:08] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:17:08] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1033 [15:17:28] (03PS1) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T371520) [15:18:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:18:24] (03PS2) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) [15:18:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:18:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1033 [15:18:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1398 to wikikube-worker1033 [15:19:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1398 to... [15:19:21] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1033.eqiad.wmnet on all recursors [15:19:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1033.eqiad.wmnet on all recursors [15:19:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1033.eqiad.wmnet with OS bullseye [15:19:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106565 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [15:20:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage [15:20:21] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2061 - hnowlan@cumin1002" [15:20:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2061 - hnowlan@cumin1002" [15:20:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:26] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2061.codfw.wmnet 47.0.192.10.in-addr.arpa 7.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:20:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2061.codfw.wmnet 47.0.192.10.in-addr.arpa 7.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:20:29] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2061 [15:21:05] (03CR) 10CI reject: [V:04-1] grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [15:21:10] PROBLEM - Host mw2385 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:06] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T373696 (10Clement_Goubert) 03NEW [15:22:15] (03Merged) 10jenkins-bot: tox: run less environments on CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069220 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [15:22:23] (03PS3) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) [15:22:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2061 [15:22:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2061 [15:23:28] (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes20(30|57) as wikikube-worker206[45] [puppet] - 10https://gerrit.wikimedia.org/r/1069231 (https://phabricator.wikimedia.org/T372878) [15:23:42] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage [15:23:50] (03CR) 10Elukey: [C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:23:51] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106587 (10Clement_Goubert) [15:24:20] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106591 (10Clement_Goubert) [15:24:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P68319 and previous config saved to /var/cache/conftool/dbconfig/20240830-152457-ladsgroup.json [15:25:36] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2295,mw2296,mw2297 - https://phabricator.wikimedia.org/T373669#10106589 (10Clement_Goubert) →14Duplicate dup:03T373591 [15:26:25] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10106599 (10elukey) Unblocked! Next steps: 1) Release a new version of Spicerack to include https://gerrit.wikimed... [15:26:58] (03CR) 10Hnowlan: [C:03+1] kubernetes: re-name/IP kubernetes20(30|57) as wikikube-worker206[45] [puppet] - 10https://gerrit.wikimedia.org/r/1069231 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [15:27:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage [15:28:18] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2030.codfw.wmnet [15:28:21] PROBLEM - Host mw2384 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:51] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2030.codfw.wmnet [15:29:07] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2057.codfw.wmnet [15:29:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2057.codfw.wmnet [15:30:08] (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes20(30|57) as wikikube-worker206[45] [puppet] - 10https://gerrit.wikimedia.org/r/1069231 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [15:31:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage [15:32:54] (03PS3) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) [15:33:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:33:01] (03CR) 10Elukey: [V:03+2 C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:33:04] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2030 to wikikube-worker2064 [15:33:12] (03CR) 10Ssingh: [C:03+1] "Seems OK in theory but perhaps wait for someone from Infrastructure Foundations to review as well!" [puppet] - 10https://gerrit.wikimedia.org/r/1069229 (owner: 10Thcipriani) [15:33:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1033.eqiad.wmnet with reason: host reimage [15:33:35] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:34:15] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 [15:35:30] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:35:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage [15:37:03] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2030 to wikikube-worker2064 - swfrench@cumin2002" [15:37:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68320 and previous config saved to /var/cache/conftool/dbconfig/20240830-153715-ladsgroup.json [15:37:20] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:37:36] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2030 to wikikube-worker2064 - swfrench@cumin2002" [15:37:36] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:37:37] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2064 [15:37:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2064 [15:38:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:38:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1033.eqiad.wmnet with reason: host reimage [15:38:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2030 to wikikube-worker2064 [15:38:40] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage [15:38:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [15:39:18] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2057 to wikikube-worker2065 [15:39:26] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:40:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T370903)', diff saved to https://phabricator.wikimedia.org/P68322 and previous config saved to /var/cache/conftool/dbconfig/20240830-154004-ladsgroup.json [15:40:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:40:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:40:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:40:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:40:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:40:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:40:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2063.codfw.wmnet with OS bullseye [15:40:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T370903)', diff saved to https://phabricator.wikimedia.org/P68323 and previous config saved to /var/cache/conftool/dbconfig/20240830-154054-ladsgroup.json [15:41:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:41:35] !log homer 'lsw1-a3-codfw*' commit 'T351074' [15:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:39] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:41:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage [15:42:54] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2057 to wikikube-worker2065 - swfrench@cumin2002" [15:43:22] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2057 to wikikube-worker2065 - swfrench@cumin2002" [15:43:23] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:24] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2065 [15:43:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2065 [15:44:15] (03Merged) 10jenkins-bot: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:44:19] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2057 to wikikube-worker2065 [15:44:33] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2064.codfw.wmnet wikikube-worker2065.codfw.wmnet on all recursors [15:44:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [15:44:36] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2064.codfw.wmnet wikikube-worker2065.codfw.wmnet on all recursors [15:45:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:45:36] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2064.codfw.wmnet with OS bullseye [15:45:47] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2064 [15:45:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106649 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [15:46:11] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:47:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2060.codfw.wmnet with OS bullseye [15:47:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [15:49:00] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2063.codfw.wmnet [15:49:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2063.codfw.wmnet [15:49:33] !log homer 'cr*eqiad*' commit 'T351074' [15:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:38] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:49:52] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2064 - swfrench@cumin2002" [15:49:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2064 - swfrench@cumin2002" [15:49:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:49:58] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2064.codfw.wmnet 211.16.192.10.in-addr.arpa 1.1.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:50:01] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2064.codfw.wmnet 211.16.192.10.in-addr.arpa 1.1.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:50:02] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2064 [15:50:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [15:50:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2064 [15:50:44] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2064 [15:52:07] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2065.codfw.wmnet with OS bullseye [15:52:18] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2065 [15:52:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P68325 and previous config saved to /var/cache/conftool/dbconfig/20240830-155222-ladsgroup.json [15:52:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [15:52:57] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:53:47] !log homer 'lsw1-a3-codfw*' commit [15:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:03] hnowlan: probably won't show anything, I just ran it and it had the changes for 61, 62 and 63 [15:55:08] ah cool [15:55:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2062.codfw.wmnet with OS bullseye [15:55:14] yeah you're right [15:55:21] (03CR) 10Eevans: [C:03+2] Update references to latest beta restbase node [puppet] - 10https://gerrit.wikimedia.org/r/1069148 (https://phabricator.wikimedia.org/T370460) (owner: 10Jgiannelos) [15:55:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [15:55:28] (03CR) 10Andrea Denisse: "LGTM, I just think there's a small typo in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [15:56:32] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2065 - swfrench@cumin2002" [15:56:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2065 - swfrench@cumin2002" [15:56:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:38] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2065.codfw.wmnet 235.16.192.10.in-addr.arpa 5.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:56:41] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2065.codfw.wmnet 235.16.192.10.in-addr.arpa 5.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:56:42] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2065 [15:56:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2065 [15:56:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2065 [15:57:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1033.eqiad.wmnet with OS bullseye [15:57:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:58:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T370903)', diff saved to https://phabricator.wikimedia.org/P68326 and previous config saved to /var/cache/conftool/dbconfig/20240830-155842-ladsgroup.json [15:58:47] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:01:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2061.codfw.wmnet with OS bullseye [16:01:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [16:02:39] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 46, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:47] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2060.codfw.wmnet [16:02:50] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2060.codfw.wmnet [16:07:18] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2061.codfw.wmnet [16:07:20] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2061.codfw.wmnet [16:07:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P68328 and previous config saved to /var/cache/conftool/dbconfig/20240830-160729-ladsgroup.json [16:07:30] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2062.codfw.wmnet [16:07:31] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2062.codfw.wmnet [16:08:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106681 (10hnowlan) [16:09:48] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage [16:11:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:12:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage [16:13:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P68329 and previous config saved to /var/cache/conftool/dbconfig/20240830-161349-ladsgroup.json [16:13:59] (03CR) 10Scott French: [C:03+1] cfssl-issuer: Add external-services support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068768 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:15:48] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage [16:16:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:19:32] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage [16:21:56] !log flipping BGP flag to true in netbox for ml-serve-ctrl100[1-2],ml-serve100[1-4],dse-k8s-ctrl100[1-2],dse-k8s-worker100[1-4] [16:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68330 and previous config saved to /var/cache/conftool/dbconfig/20240830-162236-ladsgroup.json [16:22:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [16:22:41] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:22:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [16:22:55] claime: wait, the flag was _false_ for ml-serve machines < 1009? [16:22:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T371742)', diff saved to https://phabricator.wikimedia.org/P68331 and previous config saved to /var/cache/conftool/dbconfig/20240830-162258-ladsgroup.json [16:23:07] klausman: for all these above yes [16:23:13] weeeird. [16:23:20] definitely should not be the case. [16:23:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2057.codfw.wmnet [16:23:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2057.codfw.wmnet [16:23:27] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2058.codfw.wmnet [16:23:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2058.codfw.wmnet [16:23:30] I'll have a look at NB logs [16:23:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2059.codfw.wmnet [16:23:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2059.codfw.wmnet [16:23:36] klausman: what's weirder is that I don't see changelogs for it [16:23:59] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw237[789] - https://phabricator.wikimedia.org/T373699 (10akosiaris) 03NEW [16:24:57] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106748 (10Clement_Goubert) [16:25:26] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106755 (10Clement_Goubert) [16:26:26] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw237[789] - https://phabricator.wikimedia.org/T373699#10106753 (10Clement_Goubert) →14Duplicate dup:03T373591 [16:26:41] !log homer 'cr*eqiad*' commit 'T351074, T372878, and fix ml-serve and dse-k8s bgp' [16:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:46] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:26:47] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:28:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P68332 and previous config saved to /var/cache/conftool/dbconfig/20240830-162856-ladsgroup.json [16:30:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:32:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2064.codfw.wmnet with OS bullseye [16:33:05] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [16:36:17] (03PS4) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) [16:36:33] (03CR) 10Herron: grafana: set thanos as default datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [16:38:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [16:39:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2065.codfw.wmnet with OS bullseye [16:39:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10106783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [16:39:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1033.eqiad.wmnet [16:39:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1033.eqiad.wmnet [16:40:31] !log running homer 'lsw1-b3-codfw*' commit 'T372878' [16:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:36] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:42:24] (03CR) 10Andrew Bogott: Make cloudcephosd1039-1041 into ceph osd nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott) [16:42:35] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2064.codfw.wmnet [16:42:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2064.codfw.wmnet [16:42:50] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2065.codfw.wmnet [16:42:52] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2065.codfw.wmnet [16:44:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T370903)', diff saved to https://phabricator.wikimedia.org/P68333 and previous config saved to /var/cache/conftool/dbconfig/20240830-164403-ladsgroup.json [16:44:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:44:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:44:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:44:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T370903)', diff saved to https://phabricator.wikimedia.org/P68334 and previous config saved to /var/cache/conftool/dbconfig/20240830-164425-ladsgroup.json [16:47:17] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10106812 (10Scott_French) [16:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:53:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T371742)', diff saved to https://phabricator.wikimedia.org/P68335 and previous config saved to /var/cache/conftool/dbconfig/20240830-165322-ladsgroup.json [16:53:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:59:47] !log running homer 'cr*codfw*' commit 'T372878' [16:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [17:00:05] PROBLEM - Host mw2377 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 421, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T370903)', diff saved to https://phabricator.wikimedia.org/P68336 and previous config saved to /var/cache/conftool/dbconfig/20240830-170238-ladsgroup.json [17:02:43] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:06:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 503, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:50] (03CR) 10Dzahn: "ACK, and thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [17:08:12] (03PS1) 10Andrew Bogott: Move idc/oidc keystone secrets to a place where we can find them [labs/private] - 10https://gerrit.wikimedia.org/r/1069250 [17:08:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P68337 and previous config saved to /var/cache/conftool/dbconfig/20240830-170829-ladsgroup.json [17:10:28] (03PS4) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [17:10:47] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Move idc/oidc keystone secrets to a place where we can find them [labs/private] - 10https://gerrit.wikimedia.org/r/1069250 (owner: 10Andrew Bogott) [17:11:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [17:17:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P68338 and previous config saved to /var/cache/conftool/dbconfig/20240830-171745-ladsgroup.json [17:19:13] (03CR) 10Scott French: [C:03+1] "Nice! I see one or two things on the parent, but will comment there." [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 (owner: 10Clément Goubert) [17:22:09] PROBLEM - Host mw2378 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P68339 and previous config saved to /var/cache/conftool/dbconfig/20240830-172336-ladsgroup.json [17:28:02] ^ this seems to be a rename from mw2378 to wikikube-worker2058 [17:28:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:28:49] and then maybe the check is against the older hostname and which is why it is failing because the new host seems to be up [17:30:38] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10106962 (10Dzahn) The server `lists2001` mentioned here for Collaboration Services is standby and therefore ok to do anytime. [17:32:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P68340 and previous config saved to /var/cache/conftool/dbconfig/20240830-173253-ladsgroup.json [17:33:07] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:10] oohh [17:33:22] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10106969 (10Dzahn) The server `phab2002` mentioned here for Collaboration Services is standby and therefore ok to do anytime. [17:33:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.941s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:34:25] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10106972 (10Dzahn) The server `gerrit2002` mentioned here for Collaboration Services is a replica, not the main host. It's somewhat in pro... [17:35:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:53] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:38:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.941s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:38:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T371742)', diff saved to https://phabricator.wikimedia.org/P68341 and previous config saved to /var/cache/conftool/dbconfig/20240830-173843-ladsgroup.json [17:38:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:38:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:38:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:39:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T371742)', diff saved to https://phabricator.wikimedia.org/P68342 and previous config saved to /var/cache/conftool/dbconfig/20240830-173905-ladsgroup.json [17:40:40] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:29] (03CR) 10Eevans: [C:03+1] changeprop: Update references to latest beta restbase node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069145 (https://phabricator.wikimedia.org/T370460) (owner: 10Jgiannelos) [17:44:25] !log releases1003/2003 - sudo apt-get remove openjdk-11-* - Java 11 has been replaced by Java 17 - T359795 [17:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:30] T359795: Switch Jenkins instances from Java 11 to Java 17 - https://phabricator.wikimedia.org/T359795 [17:48:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T370903)', diff saved to https://phabricator.wikimedia.org/P68343 and previous config saved to /var/cache/conftool/dbconfig/20240830-174800-ladsgroup.json [17:48:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [17:48:05] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:48:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [17:48:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68344 and previous config saved to /var/cache/conftool/dbconfig/20240830-174822-ladsgroup.json [17:54:53] (03PS3) 10Msz2001: Enable EditCheck references on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069257 (https://phabricator.wikimedia.org/T373079) [17:55:03] (03PS1) 10Physikerwelt: Remove redundandant setting of $wgDefaultUserOptions['math'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) [17:55:36] (03CR) 10Scott French: "This is great! A couple of comments, but otherwise LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [17:59:08] (03CR) 10JHathaway: [C:03+1] profile::puppetserver: set java_start_mem to 40g [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [18:01:22] (03CR) 10DLynch: [C:03+1] Enable EditCheck references on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069257 (https://phabricator.wikimedia.org/T373079) (owner: 10Msz2001) [18:07:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T371742)', diff saved to https://phabricator.wikimedia.org/P68345 and previous config saved to /var/cache/conftool/dbconfig/20240830-180757-ladsgroup.json [18:08:02] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:08:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68346 and previous config saved to /var/cache/conftool/dbconfig/20240830-180843-ladsgroup.json [18:08:48] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:18:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:20:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:20:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:05] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:23:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P68347 and previous config saved to /var/cache/conftool/dbconfig/20240830-182304-ladsgroup.json [18:23:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P68348 and previous config saved to /var/cache/conftool/dbconfig/20240830-182350-ladsgroup.json [18:32:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:32:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:33:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069257 (https://phabricator.wikimedia.org/T373079) (owner: 10Msz2001) [18:35:40] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P68349 and previous config saved to /var/cache/conftool/dbconfig/20240830-183812-ladsgroup.json [18:38:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P68350 and previous config saved to /var/cache/conftool/dbconfig/20240830-183858-ladsgroup.json [18:44:35] FIRING: CirrusSearchMoreLikeLatencyTooHigh: ... [18:44:36] CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:49:35] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: ... [18:49:35] CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:51:03] !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet [18:51:27] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10107138 (10ops-monitoring-bot) Host rebooted by eevans@cumin1002 with reason: None [18:53:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T371742)', diff saved to https://phabricator.wikimedia.org/P68351 and previous config saved to /var/cache/conftool/dbconfig/20240830-185319-ladsgroup.json [18:53:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:53:24] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:53:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:53:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T371742)', diff saved to https://phabricator.wikimedia.org/P68352 and previous config saved to /var/cache/conftool/dbconfig/20240830-185341-ladsgroup.json [18:54:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T370903)', diff saved to https://phabricator.wikimedia.org/P68353 and previous config saved to /var/cache/conftool/dbconfig/20240830-185405-ladsgroup.json [18:54:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:54:10] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:54:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:54:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68354 and previous config saved to /var/cache/conftool/dbconfig/20240830-185427-ladsgroup.json [18:55:53] FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:07] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet [19:00:53] RESOLVED: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T371742)', diff saved to https://phabricator.wikimedia.org/P68355 and previous config saved to /var/cache/conftool/dbconfig/20240830-192021-ladsgroup.json [19:20:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:34:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68356 and previous config saved to /var/cache/conftool/dbconfig/20240830-193413-ladsgroup.json [19:34:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:35:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P68357 and previous config saved to /var/cache/conftool/dbconfig/20240830-193528-ladsgroup.json [19:43:58] (03PS5) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [19:43:58] (03PS1) 10Andrew Bogott: Keystone: make codfw1dev keystone APIs public [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) [19:46:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [19:49:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68358 and previous config saved to /var/cache/conftool/dbconfig/20240830-194919-ladsgroup.json [19:49:43] (03PS2) 10Andrew Bogott: Keystone: make codfw1dev keystone APIs public [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) [19:49:44] (03PS6) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [19:50:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P68359 and previous config saved to /var/cache/conftool/dbconfig/20240830-195037-ladsgroup.json [19:52:07] (03PS3) 10Andrew Bogott: Keystone: make codfw1dev keystone APIs public [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) [19:52:08] (03PS7) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [19:52:09] PROBLEM - Disk space on grafana1002 is CRITICAL: DISK CRITICAL - free space: / 585MiB (3% inode=53%): /tmp 585MiB (3% inode=53%): /var/tmp 585MiB (3% inode=53%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana1002&var-datasource=eqiad+prometheus/ops [19:52:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [19:55:53] (03CR) 10Andrew Bogott: [C:03+2] Keystone: make codfw1dev keystone APIs public [puppet] - 10https://gerrit.wikimedia.org/r/1069279 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [20:04:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68361 and previous config saved to /var/cache/conftool/dbconfig/20240830-200427-ladsgroup.json [20:05:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T371742)', diff saved to https://phabricator.wikimedia.org/P68362 and previous config saved to /var/cache/conftool/dbconfig/20240830-200544-ladsgroup.json [20:05:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:05:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:06:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:06:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T371742)', diff saved to https://phabricator.wikimedia.org/P68363 and previous config saved to /var/cache/conftool/dbconfig/20240830-200606-ladsgroup.json [20:10:29] RECOVERY - Host mw2295 is UP: PING WARNING - Packet loss = 33%, RTA = 0.27 ms [20:11:19] PROBLEM - SSH on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:12:09] RECOVERY - Disk space on grafana1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana1002&var-datasource=eqiad+prometheus/ops [20:16:53] PROBLEM - Host mw2295 is DOWN: PING CRITICAL - Packet loss = 100% [20:19:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68364 and previous config saved to /var/cache/conftool/dbconfig/20240830-201934-ladsgroup.json [20:19:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:19:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:19:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:19:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T370903)', diff saved to https://phabricator.wikimedia.org/P68365 and previous config saved to /var/cache/conftool/dbconfig/20240830-201956-ladsgroup.json [20:30:24] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:33:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:40] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:58:40] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T371742)', diff saved to https://phabricator.wikimedia.org/P68366 and previous config saved to /var/cache/conftool/dbconfig/20240830-210014-ladsgroup.json [21:00:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:13] (03PS2) 10Bartosz Dziewoński: logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444) [21:10:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T370903)', diff saved to https://phabricator.wikimedia.org/P68367 and previous config saved to /var/cache/conftool/dbconfig/20240830-211028-ladsgroup.json [21:10:33] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:15:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P68368 and previous config saved to /var/cache/conftool/dbconfig/20240830-211521-ladsgroup.json [21:19:15] (03Abandoned) 10Bartosz Dziewoński: logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444) (owner: 10Bartosz Dziewoński) [21:25:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68369 and previous config saved to /var/cache/conftool/dbconfig/20240830-212535-ladsgroup.json [21:30:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P68370 and previous config saved to /var/cache/conftool/dbconfig/20240830-213028-ladsgroup.json [21:33:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:39:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:40:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68371 and previous config saved to /var/cache/conftool/dbconfig/20240830-214042-ladsgroup.json [21:44:35] PROBLEM - NTP peers on dns1006 is CRITICAL: NTP CRITICAL: Offset 0.189255719 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [21:45:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T371742)', diff saved to https://phabricator.wikimedia.org/P68372 and previous config saved to /var/cache/conftool/dbconfig/20240830-214536-ladsgroup.json [21:45:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:45:43] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:45:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:45:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T371742)', diff saved to https://phabricator.wikimedia.org/P68373 and previous config saved to /var/cache/conftool/dbconfig/20240830-214558-ladsgroup.json [21:53:35] RECOVERY - NTP peers on dns1006 is OK: NTP OK: Offset 0.001082523 secs https://wikitech.wikimedia.org/wiki/NTP [21:55:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T370903)', diff saved to https://phabricator.wikimedia.org/P68374 and previous config saved to /var/cache/conftool/dbconfig/20240830-215549-ladsgroup.json [21:55:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [21:55:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:56:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [21:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T370903)', diff saved to https://phabricator.wikimedia.org/P68375 and previous config saved to /var/cache/conftool/dbconfig/20240830-215611-ladsgroup.json [21:58:40] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:23] (03PS1) 10MusikAnimal: Remove $wgCodeMirrorRTL temporary feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) [22:13:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T370903)', diff saved to https://phabricator.wikimedia.org/P68376 and previous config saved to /var/cache/conftool/dbconfig/20240830-221319-ladsgroup.json [22:13:24] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:28:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68377 and previous config saved to /var/cache/conftool/dbconfig/20240830-222826-ladsgroup.json [22:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:43] (03PS1) 10Cwhite: loki: increase chunk flush interval [puppet] - 10https://gerrit.wikimedia.org/r/1069301 (https://phabricator.wikimedia.org/T335610) [22:36:10] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68378 and previous config saved to /var/cache/conftool/dbconfig/20240830-224333-ladsgroup.json [22:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T370903)', diff saved to https://phabricator.wikimedia.org/P68379 and previous config saved to /var/cache/conftool/dbconfig/20240830-225840-ladsgroup.json [22:58:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [22:58:46] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:58:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [22:59:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T370903)', diff saved to https://phabricator.wikimedia.org/P68380 and previous config saved to /var/cache/conftool/dbconfig/20240830-225902-ladsgroup.json [23:01:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T371742)', diff saved to https://phabricator.wikimedia.org/P68381 and previous config saved to /var/cache/conftool/dbconfig/20240830-230059-ladsgroup.json [23:01:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:01:10] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:39] (03PS1) 10Catrope: CodexModule: Fix double-flipping in RTL [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1069310 (https://phabricator.wikimedia.org/T373676) [23:12:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1069310 (https://phabricator.wikimedia.org/T373676) (owner: 10Catrope) [23:16:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P68382 and previous config saved to /var/cache/conftool/dbconfig/20240830-231606-ladsgroup.json [23:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P68383 and previous config saved to /var/cache/conftool/dbconfig/20240830-233113-ladsgroup.json [23:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:42] (03PS1) 10Bartosz Dziewoński: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 [23:37:42] (03PS1) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [23:38:52] (03PS1) 10Dzahn: contint: add java jdk-17 packages in addition to jdk-11 [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) [23:39:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069326 [23:39:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069326 (owner: 10TrainBranchBot) [23:39:47] (03CR) 10Bartosz Dziewoński: "I'm just trying to understand what logging.php really does, and finding bizarre things. I wrote the commit message in a confident tone, bu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [23:40:03] (03CR) 10Bartosz Dziewoński: "I'm just trying to understand what logging.php really does, and finding bizarre things. I wrote the commit message in a confident tone, bu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [23:40:27] (03PS1) 10Dzahn: contint: switch java_home from jdk-11 to jdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) [23:42:04] (03PS1) 10Dzahn: contint: remove jdk-11 packages [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) [23:42:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T370903)', diff saved to https://phabricator.wikimedia.org/P68384 and previous config saved to /var/cache/conftool/dbconfig/20240830-234257-ladsgroup.json [23:43:02] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:44:41] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1069325/3802/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [23:46:09] (03PS1) 10Stoyofuku-wmf: Turn on donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) [23:46:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T371742)', diff saved to https://phabricator.wikimedia.org/P68385 and previous config saved to /var/cache/conftool/dbconfig/20240830-234621-ladsgroup.json [23:46:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [23:46:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:46:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [23:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68386 and previous config saved to /var/cache/conftool/dbconfig/20240830-235804-ladsgroup.json