[00:09:26] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:41] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062156 (owner: 10TrainBranchBot) [00:27:32] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:26] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:59:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10060266 (10phaultfinder) [01:00:26] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963) [01:08:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [01:37:02] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0200) [02:09:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:26] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:25] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0300) [03:01:22] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963) [03:01:23] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [03:02:04] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [03:02:25] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.18 refs T366963 [03:02:28] T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963 [03:25:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:30:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:50:52] !log mwpresync@deploy1003 Finished scap: testwikis to 1.43.0-wmf.18 refs T366963 (duration: 48m 26s) [03:50:55] T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963 [03:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0400) [04:00:56] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.15 (duration: 00m 56s) [04:27:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:44:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:26] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:24:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0600) [06:00:04] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:26] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1463:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1463 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:48:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060428 (10Marostegui) @VRiley-WMF the issue was the disk? There was nothing related to disks on the error log - I am curious to know how was the problem identified. [06:48:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060432 (10Marostegui) The host also needs to be repooled. [06:49:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060429 (10Marostegui) 05Resolved→03Open Reopening only to keep track that we are waiting for an answer on this. [06:50:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw1463:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1463 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:08:26] (03CR) 10Filippo Giunchedi: "Will need more testing" [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [07:10:19] (03CR) 10Filippo Giunchedi: "I don't think I have enough context for review" [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [07:12:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1 master: es1027', diff saved to https://phabricator.wikimedia.org/P67282 and previous config saved to /var/cache/conftool/dbconfig/20240813-071240-arnaudb.json [07:13:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10060442 (10ABran-WMF) @VRiley-WMF please let me know when you're ready, I'll depool the node then [07:13:40] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369 (10dchan) 03NEW [07:14:03] (03CR) 10Ayounsi: [C:03+2] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [07:20:33] (03CR) 10Ayounsi: [C:03+2] Netbox script proxy: set to absent [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [07:24:45] (03PS1) 10Jdlrobson: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 [07:29:03] (03PS2) 10Jdlrobson: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) [07:32:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [07:32:40] (03PS8) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [07:39:59] (03CR) 10Ayounsi: [C:03+2] Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney) [07:42:40] (03Merged) 10jenkins-bot: Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney) [07:43:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [07:43:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [07:47:23] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:47:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:53] (03CR) 10Ayounsi: "Adding myself as CC on this so I don't forget to remove scandium from https://github.com/wikimedia/operations-software-netbox-extras/blob/" [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) (owner: 10Alexandros Kosiaris) [07:56:08] (03CR) 10Fabfur: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) (owner: 10Dwisehaupt) [08:00:18] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:06:18] (03PS1) 10Ilias Sarantopoulos: ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 [08:09:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060574 (10ayounsi) 05Resolved→03Open https://netbox.wikimedia.org/extras/scripts/results/78992/ `cloudcephosd1039 (WMF11571) /dcim/devices/5296/ Pr... [08:11:44] (03CR) 10Marostegui: [C:03+1] backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) (owner: 10Arnaudb) [08:15:06] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:18:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:19:49] !log upgrade postgresql on netboxdb hosts [08:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:26] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:43] (03PS1) 10Kevin Bazira: ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) [08:22:21] (03CR) 10AOkoth: [C:03+2] vtrs: add confirmation prompt [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:25:38] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos) [08:25:47] (03CR) 10AOkoth: [C:03+2] Revert "vrts: change root mail alias" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 (owner: 10AOkoth) [08:26:00] (03Abandoned) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [08:26:57] (03PS1) 10Arnaudb: mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349 [08:27:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:27:51] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos) [08:30:26] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:25] (03Merged) 10jenkins-bot: ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos) [08:34:13] (03CR) 10Marostegui: [C:03+1] mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349 (owner: 10Arnaudb) [08:35:06] (03CR) 10Arnaudb: [C:03+2] mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349 (owner: 10Arnaudb) [08:35:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10060634 (10JMeybohm) Correct. Anything that is at least as big as the ~900G of the four currently installed SSDs will be fine. Thanks! [08:36:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [08:38:36] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1060761 (owner: 10Filippo Giunchedi) [08:42:50] (03PS2) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) [08:43:01] (03CR) 10Klausman: [C:03+1] ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [08:43:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1078 has no connectivity - https://phabricator.wikimedia.org/T372289#10060653 (10MatthewVernon) [08:43:51] (03CR) 10Stevemunene: dns: provision airflow-test-k8s temp domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [08:46:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060683 (10MatthewVernon) [08:46:16] (03CR) 10JMeybohm: [C:03+2] Add reuse-raid10-6dev profile to be used by new kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062033 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [08:48:04] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:48:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:49:37] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2002.codfw.wmnet [08:51:43] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:52:34] !log upgrade conftool python packages on puppetserver1001 to 3.2.2 [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:41] (03CR) 10Kevin Bazira: [C:03+2] ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [08:54:45] (03Merged) 10jenkins-bot: ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [08:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:59:02] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:00:26] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:18] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:02:51] (03CR) 10Btullis: [C:03+1] Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:03:07] (03CR) 10Btullis: [C:03+1] Upgrade airflow analytics instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062023 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:03:14] (03CR) 10Btullis: [C:03+1] Upgrade airflow search instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062022 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:03:41] (03CR) 10Btullis: [C:03+1] Upgrade airflow analytics_product instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062021 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:03:44] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [09:03:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10060810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bu... [09:04:34] (03PS1) 10Ayounsi: Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890) [09:08:42] (03CR) 10Btullis: [C:03+1] Upgrade airflow platform_eng instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062020 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:09:02] (03CR) 10Btullis: [C:03+1] Upgrade airflow research instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062019 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:09:09] (03CR) 10Btullis: [C:03+1] Upgrade airflow wmde instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062018 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [09:10:26] FIRING: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:32] (03PS1) 10MVernon: swift: mark ms-be1058 / sdc1 failed [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207) [09:12:41] (03PS1) 10Ayounsi: Update wheels to pickup new pynetbox version [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1062356 (https://phabricator.wikimedia.org/T371890) [09:16:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060823 (10MatthewVernon) [09:18:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060832 (10MatthewVernon) @VRiley-WMF can you confirm my understanding of the state of (lack of) spare drives is correct, please? [09:19:24] (03CR) 10Marostegui: [C:03+1] "Checked that sdc is the broken one" [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207) (owner: 10MVernon) [09:20:01] (03CR) 10Marostegui: [C:03+1] dbproxy: mirrors hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [09:22:09] (03PS1) 10Ayounsi: check_netbox_report.py: use venv's python [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) [09:23:51] !log manual run of dump_cloud_ip_ranges.service on puppetserver1001 (failed earlier on) [09:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:52] (03CR) 10Arnaudb: [C:03+2] dbproxy: mirrors hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [09:25:01] (03CR) 10MVernon: [C:03+2] swift: mark ms-be1058 / sdc1 failed [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207) (owner: 10MVernon) [09:25:23] (03CR) 10Filippo Giunchedi: [C:03+2] "Sure np, thank you for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1060761 (owner: 10Filippo Giunchedi) [09:28:07] (03PS1) 10Marostegui: installserver: Do not reimge db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1062360 [09:29:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:44] (03CR) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 (034 comments) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:30:13] (03PS2) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [09:30:13] (03PS2) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [09:33:21] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimge db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1062360 (owner: 10Marostegui) [09:39:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060929 (10dcaro) I got this when trying to set the fqdn (checked others that have the fqdn set on the ipv6, and they don't have the role set, maybe a new... [09:40:27] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424 [09:40:29] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [09:40:40] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424 [09:41:17] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s5 [09:41:20] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s8 [09:46:54] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1016.eqiad.wmnet with OS bookworm [09:52:50] (03PS1) 10Ayounsi: ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365 [09:54:54] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [09:55:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10060991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error... [09:55:36] (03CR) 10Ayounsi: [C:03+2] ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365 (owner: 10Ayounsi) [09:57:20] (03Merged) 10jenkins-bot: ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365 (owner: 10Ayounsi) [09:58:14] (03CR) 10Wargo: [C:03+1] [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [09:58:35] (03PS3) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [09:58:35] (03PS3) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [09:59:09] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:59:27] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage [09:59:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1000) [10:00:11] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:01:46] (03PS4) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [10:01:47] (03PS4) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [10:02:07] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage [10:02:50] (03CR) 10Stevemunene: [C:03+2] Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [10:05:38] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet [10:05:51] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-airflow1007.eqiad.wmnet [10:06:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:07:23] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet [10:09:10] (03PS2) 10Klausman: api-gw/liftwing: add missing trailing `/` to path trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) [10:09:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:56] !log dcaro@cumin1002 START - Cookbook sre.dns.netbox [10:11:36] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1007.eqiad.wmnet [10:12:15] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow wmde instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062018 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [10:13:28] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:13:31] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:15:51] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s8 [10:16:52] (03PS5) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [10:16:52] (03PS5) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [10:18:13] (03CR) 10JMeybohm: "Sorry for the back and forth @swfrench@wikimedia.org - I messed up the initial version of this as it would not cleanly merge into main. I " [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [10:26:15] (03CR) 10Filippo Giunchedi: [C:03+1] "I apologize for the late review, LGTM! Maybe merge early next week due to (most of) europe short week this week" [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [10:26:54] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1016.eqiad.wmnet with OS bookworm [10:27:12] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:27:17] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:29:11] (03CR) 10Filippo Giunchedi: Prometheus: Add recording rules computing commonly used envoy histograms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [10:32:42] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet [10:36:08] (03PS1) 10JMeybohm: Don't reuse partitions for initial reimage of new kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062370 (https://phabricator.wikimedia.org/T371423) [10:38:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet [10:38:45] !log dcaro@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added ipv6 entry for cloudcephosd1039 - dcaro@cumin1002" [10:38:49] !log dcaro@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added ipv6 entry for cloudcephosd1039 - dcaro@cumin1002" [10:38:49] !log dcaro@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:38:55] (03CR) 10JMeybohm: [C:03+2] Don't reuse partitions for initial reimage of new kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062370 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [10:39:15] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow research instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062019 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [10:49:11] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1004.eqiad.wmnet [10:53:04] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1004.eqiad.wmnet [10:53:27] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow platform_eng instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062020 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [10:57:33] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1006.eqiad.wmnet [10:57:46] question for the backport window later, especially if rzl is around: is there currently a recommended way to run a maintenance script with mwscript-k8s and dump the output in a file? [10:58:30] options I can think of: a) --attach > outfile; b) kubectl [as printed by mwscript-k8s] > outfile; c) don’t use mwscript-k8s for this yet ^^ [11:01:57] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1006.eqiad.wmnet [11:03:22] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow analytics_product instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062021 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:06:08] (if I don’t hear from anyone I’ll probably go with option c ^^) [11:07:58] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1005.eqiad.wmnet [11:11:52] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1005.eqiad.wmnet [11:17:58] !log deploy pfw policy update 1723510554 - T372367 [11:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:56] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow search instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062022 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:28:22] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [11:28:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bu... [11:29:08] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-launcher1002.eqiad.wmnet [11:35:33] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-launcher1002.eqiad.wmnet [11:37:10] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow analytics instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062023 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:38:57] (03PS2) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) [11:40:49] (03CR) 10Jdlrobson: [C:03+1] "Hi Jeena: this needs to be merged before we roll out the train." [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson) [11:42:15] (03PS5) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) [11:42:44] (03PS6) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) [11:43:01] (03CR) 10Arnaudb: mariadb: observability - adds shard information on recording rule (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [11:43:19] (03PS3) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) [11:43:23] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:45:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:26] (03CR) 10CI reject: [V:04-1] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:48:59] (03PS4) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) [11:51:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10061264 (10dcaro) 05Open→03Resolved Done :) [11:52:01] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1200) [12:06:17] (03CR) 10Brouberol: "Actually, I think we should drop this and consider moving the deployment of PG to the airflow helmfile instead. The reason for this is men" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [12:06:25] (03PS1) 10Marostegui: mariadb: Change s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1062379 (https://phabricator.wikimedia.org/T371361) [12:07:41] (03CR) 10Marostegui: [C:03+2] mariadb: Change s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1062379 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [12:11:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1062381 (https://phabricator.wikimedia.org/T372393) [12:11:12] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) [12:13:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061410 (10VRiley-WMF) My apologies! Disregard the drive replaced comment as I meant that for a different ticket. I will be updating the firmware on this device to see if this resolves... [12:13:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061411 (10Marostegui) @VRiley-WMF let us know when you'd like to do this, as we need to switch of MySQL [12:14:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061414 (10Marostegui) [12:16:08] (03CR) 10Hnowlan: [C:03+1] api-gw/liftwing: add missing trailing `/` to path trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) (owner: 10Klausman) [12:18:36] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [12:18:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error... [12:19:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:38] (03PS1) 10Stevemunene: Revert "Temporarily disable gobblin timers to upgrade Airflow" [puppet] - 10https://gerrit.wikimedia.org/r/1062387 [12:23:08] (03CR) 10Andrew Bogott: [C:03+2] git-sync-upstream: use sudo for puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1062067 (owner: 10Andrew Bogott) [12:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:27:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:50] (03Abandoned) 10Brouberol: airflow: add conditional dependency to cloudnative-pg-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [12:30:52] (03CR) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [12:31:10] (03CR) 10Stevemunene: [C:03+2] Revert "Temporarily disable gobblin timers to upgrade Airflow" [puppet] - 10https://gerrit.wikimedia.org/r/1062387 (owner: 10Stevemunene) [12:36:50] (03PS5) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) [12:37:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [12:37:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [12:40:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061463 (10JMeybohm) @Jhancock.wm could you please check kafka-main2010 again? After trying to re-image I now only see 5 disks in iDRAC. [12:40:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:12] (03CR) 10Filippo Giunchedi: [C:03+2] mediawiki: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [12:42:47] jouncebot: now and next [12:42:47] For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1200) [12:43:46] !log filippo@deploy1003 Started scap sync-world: new statsd-exporter limits [12:46:42] (03CR) 10Elukey: [C:03+1] sre.dns.admin: add cookbook for GeoDNS pool/depool (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [12:47:06] !log filippo@deploy1003 Finished scap: new statsd-exporter limits (duration: 03m 52s) [12:48:14] (03CR) 10Elukey: [C:03+1] Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [12:49:11] (03CR) 10Elukey: [C:03+1] "Fine to me, have you ran the script manually to verify that it works?" [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [12:53:38] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10061502 (10cmooney) [12:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:57:19] (03PS1) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) [12:57:44] (03CR) 10CI reject: [V:04-1] prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:58:08] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10061531 (10elukey) 05Open→03Resolved [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] (03PS2) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) [13:00:18] hi [13:00:34] anyone wants to run a maintenance script for me? :) it should take a few minutes, no more than an hour [13:00:46] o/ [13:00:48] I can :) [13:01:46] !log START lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/cleanupTitles.php --wiki=hewikisource --prefix=T314733 2>&1 | tee ~/T314733.log [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:49] T314733: Cleanup leftover pages in deleted namespaces on hewikisource - https://phabricator.wikimedia.org/T314733 [13:02:20] that sure is printing a lot of output :CatJam: [13:02:55] (03CR) 10CI reject: [V:04-1] prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [13:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:04:07] > all 42,960 of them [13:04:08] I see [13:04:19] (also, “we were on the verge of greatness, we were this close” yadda yadda) [13:04:24] heh [13:04:58] !log FINISHED lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/cleanupTitles.php --wiki=hewikisource --prefix=T314733 2>&1 | tee ~/T314733.log [13:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:17] !log `apt-get install python3-conftool python3-conftool-requestctl` on all puppetserver nodes - upgrade to 3.2.2 [13:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:50] MatmaRex: just to double-check, you want 102k lines of output? ^^ [13:05:52] (I’ll put them in a paste) [13:06:42] Lucas_WMDE: sure. i guess i should have suggested the script option that stops it from printing progress bars [13:06:59] eh, the progress bars were nice when they weren’t buried between all the other output ^^ [13:07:10] but i want to have a record of the page titles it changed, since they're not logged otherwise [13:07:16] > File size is too large. See https://www.mediawiki.org/wiki/Phabricator/Help#Uploading_file_attachments [13:07:18] boo [13:07:23] (03PS1) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [13:07:35] gzip should do it [13:07:41] * Lucas_WMDE checks if the same applies when uploading as a file instead of paste [13:07:59] files can take arbitrary sizes [13:08:03] yup, still too large [13:08:05] let’s gzip it then [13:08:17] well, it can take many megabytes at least 😅 [13:09:15] well, less than ~11M apparently [13:09:37] https://phabricator.wikimedia.org/T314733#10061560 [13:09:42] phab file size limit is 4 MB [13:09:54] thanks Lucas_WMDE [13:11:00] np [13:11:16] (03CR) 10Ssingh: [C:03+2] P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:11:47] anything else to deploy? [13:12:23] (03PS1) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) [13:12:38] wondering if I have anything to backport but I can’t think of something right now [13:12:48] (03CR) 10CI reject: [V:04-1] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [13:19:05] Jdlrobson: how about we backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1062284 now? [13:19:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:49] (03PS1) 10Brouberol: airflow: deploy postgresql cluster before airflow itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062397 (https://phabricator.wikimedia.org/T372286) [13:19:51] (03PS1) 10Brouberol: airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) [13:20:34] (03PS2) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) [13:20:46] (03CR) 10CI reject: [V:04-1] airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:21:07] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [13:21:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3616/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [13:22:59] (03PS3) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) [13:23:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3617/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [13:23:42] (03PS2) 10Brouberol: airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) [13:25:42] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [13:25:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error... [13:26:26] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [13:26:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [13:29:41] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:58] (03PS2) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [13:35:56] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2009.codfw.wmnet with OS bullseye [13:36:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error... [13:36:34] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [13:38:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [13:39:55] (03PS1) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [13:40:12] !log update homer wheels - T371890 [13:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:15] T371890: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 [13:40:38] (03CR) 10Marostegui: "Btullis, thanks for working on this part. Next time please wait for any of us to review just in case." [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [13:40:59] (03CR) 10Ayounsi: [C:03+2] Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [13:41:48] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update wheels - ayounsi@cumin1002 [13:46:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update wheels - ayounsi@cumin1002 [13:47:05] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [13:47:48] (03CR) 10Ayounsi: "Partially yep, I'll properly test it once I951fda89d553731e7c9fa07fd5214278f69028e9 is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [13:48:25] (03CR) 10Brouberol: [C:03+1] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [13:48:25] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@109c99e]: Airflow upgrade to v 2.9.3 for analytics instance. T365449. [13:48:28] T365449: Upgrade Airflow to 2.9.3 - https://phabricator.wikimedia.org/T365449 [13:49:06] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@109c99e]: Airflow upgrade to v 2.9.3 for analytics instance. T365449. (duration: 00m 40s) [13:49:15] (03CR) 10Brouberol: [C:03+1] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [13:49:33] (03CR) 10Btullis: [V:03+1 C:03+2] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [13:49:38] (03CR) 10Brouberol: [C:03+1] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [13:51:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406 (10MatthewVernon) 03NEW [13:51:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10061765 (10MatthewVernon) p:05Triage→03High [13:52:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10061766 (10Jhancock.wm) a:03Jhancock.wm [13:54:59] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10061768 (10Jhancock.wm) a:03Jhancock.wm [13:57:15] !log UTC backport+config window done (since ~13:10, really) [13:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:01] (I’m still up for deploying that train blocker backport fwiw) [13:59:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10061783 (10Jhancock.wm) a:03Jhancock.wm [14:02:07] (03PS1) 10Btullis: Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) [14:02:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3619/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [14:04:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: prep JunOS upgrade cloudsw1-d5-eqiad [14:04:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: prep JunOS upgrade cloudsw1-d5-eqiad [14:04:56] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 30 hosts with reason: JunOS upgrade cloudsw1-d5-eqiad [14:05:20] (03PS2) 10Btullis: Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) [14:05:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 30 hosts with reason: JunOS upgrade cloudsw1-d5-eqiad [14:06:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [14:06:30] (03PS3) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [14:06:35] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:06:41] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:06:47] (03CR) 10Stevemunene: [C:03+1] Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [14:08:02] !log rebooting cloudsw1-d5-eqiad to clear errors and upgrade JunOS T371878 [14:08:21] (03CR) 10Btullis: [V:03+1 C:03+2] Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis) [14:09:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061804 (10Jhancock.wm) I located the missing disk and reseated it. it's showing as having a size of 0.94 GB. Not sure if it's bad or needs to be reformatted. lmk and... [14:10:09] (03PS87) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:10:14] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:11:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [14:16:47] (03PS3) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) [14:17:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet [14:18:40] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:18:45] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:21:10] !log btullis@deploy1003 Started deploy [airflow-dags/analytics_test@109c99e]: (no justification provided) [14:21:20] !log btullis@deploy1003 Finished deploy [airflow-dags/analytics_test@109c99e]: (no justification provided) (duration: 00m 09s) [14:21:23] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T372160#10061844 (10Jhancock.wm) I found the device with the idrac shut off. tried to start it back up. looks like it tries to boot and then crashes. tried to reset it manually with the i button. I've gotten the idrac to at l... [14:21:42] !log btullis@deploy1003 Started deploy [airflow-dags/search@109c99e]: (no justification provided) [14:22:02] !log btullis@deploy1003 Finished deploy [airflow-dags/search@109c99e]: (no justification provided) (duration: 00m 19s) [14:22:07] (03CR) 10Filippo Giunchedi: "Live in Pontoon at https://prometheus-eqiad.o11y.wmcloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [14:22:34] (03PS1) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) [14:22:41] !log btullis@deploy1003 Started deploy [airflow-dags/research@109c99e]: (no justification provided) [14:22:52] !log btullis@deploy1003 Finished deploy [airflow-dags/research@109c99e]: (no justification provided) (duration: 00m 11s) [14:23:14] !log btullis@deploy1003 Started deploy [airflow-dags/platform_eng@109c99e]: (no justification provided) [14:23:26] (03PS2) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) [14:23:38] !log btullis@deploy1003 Finished deploy [airflow-dags/platform_eng@109c99e]: (no justification provided) (duration: 00m 24s) [14:23:53] !log btullis@deploy1003 Started deploy [airflow-dags/analytics_product@109c99e]: (no justification provided) [14:24:02] !log btullis@deploy1003 Finished deploy [airflow-dags/analytics_product@109c99e]: (no justification provided) (duration: 00m 09s) [14:24:10] !log btullis@deploy1003 Started deploy [airflow-dags/wmde@109c99e]: (no justification provided) [14:24:18] !log btullis@deploy1003 Finished deploy [airflow-dags/wmde@109c99e]: (no justification provided) (duration: 00m 08s) [14:25:12] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [14:25:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error... [14:26:23] (03CR) 10Stevemunene: [C:03+2] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [14:27:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061861 (10JMeybohm) >>! In T371423#10061804, @Jhancock.wm wrote: > I located the missing disk and reseated it. it's showing as having a size of 0.94 GB. Not sure if i... [14:28:36] (03CR) 10Filippo Giunchedi: [C:03+2] "For the record: this did not yield the expected result, will try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [14:29:25] (03PS2) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [14:30:44] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [14:35:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061922 (10JMeybohm) Oh, wait. 0.9GB - I totally misread. That is obviously not okay :D Tried rescanning the drives without luck. I would assume it's broken. [14:36:41] (03PS3) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) [15:05:34] (03PS1) 10Gmodena: config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) [15:33:34] (03CR) 10Papaul: [C:03+2] Add new Frack nodes to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1062433 (owner: 10Papaul) [15:33:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10062135 (10MatthewVernon) Yes, that looks good to me. Thanks for the quick fix :) [15:34:05] (03PS1) 10Dzahn: gerrit: fix typo in hiera key name for throttling [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259) [15:34:34] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10062140 (10Papaul) [15:34:36] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:34:40] (03CR) 10Dzahn: [C:03+2] gerrit: fix typo in hiera key name for throttling [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:35:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062143 (10Papaul) [15:36:29] (03PS1) 10Arnaudb: mariadb: exclude translate_message_group_subscriptions from replication [puppet] - 10https://gerrit.wikimedia.org/r/1062436 (https://phabricator.wikimedia.org/T372287) [15:37:18] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10062141 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you [15:37:51] (03CR) 10BCornwall: [C:03+2] ncmonitor: Set ignored domains configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060891 (https://phabricator.wikimedia.org/T372076) (owner: 10BCornwall) [15:38:07] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2006.codfw.wmnet with OS bookworm [15:39:18] (03PS1) 10EoghanGaffney: admin: Add sarai-wmf to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1062437 (https://phabricator.wikimedia.org/T372290) [15:39:50] !log gerrit - starting to drop packets from abusive sources (T365259) [15:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:13] (03PS1) 10Ebernhardson: flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 [15:43:06] (03CR) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:43:12] (03CR) 10Elukey: [C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:43:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062163 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this it is ready for you [15:44:03] (03PS1) 10Dzahn: gerrit: revert dropping packets from abusive source [puppet] - 10https://gerrit.wikimedia.org/r/1062440 (https://phabricator.wikimedia.org/T365259) [15:44:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062167 (10Papaul) [15:44:28] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062168 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/34; - mem... [15:44:45] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10062169 (10elukey) Encountered an issue with the BMC's network config: ` supermicro_mgmt_network_changes = { "... [15:44:55] (03CR) 10Dzahn: [C:03+2] gerrit: revert dropping packets from abusive source [puppet] - 10https://gerrit.wikimedia.org/r/1062440 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:47:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062170 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you [15:50:10] (03CR) 10Marostegui: [C:03+1] "Looks good, remember you need to restart all sanitarium instances in both eqiad and codfw for every section" [puppet] - 10https://gerrit.wikimedia.org/r/1062436 (https://phabricator.wikimedia.org/T372287) (owner: 10Arnaudb) [15:55:48] (03CR) 10Scott French: "My apologies, Hugh - I meant to review this yesterday, but apparently I just left the tab open =/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:00:05] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:28] (03CR) 10CI reject: [V:04-1] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:15:52] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062253 (10Dzahn) Hi! The request mentions "staff access rights" but the email address isn't a WMF address. To clarify, is this really FTE staff or a contractor? Thanks! [16:17:01] (03PS2) 10BCornwall: varnish: Set Cache-Control: no-transform header [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618) [16:21:10] (03CR) 10Bking: [C:03+1] flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson) [16:22:49] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on 7 hosts with reason: prep for replacement of cloudsw1-d5-eqiad [16:23:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 7 hosts with reason: prep for replacement of cloudsw1-d5-eqiad [16:23:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10062277 (10Jhancock.wm) 05Open→03Resolved np! [16:23:54] (03CR) 10Ebernhardson: [C:03+2] flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson) [16:24:18] (03CR) 10Dzahn: [C:03+1] "lgtm, confirmed in BetterWorks" [puppet] - 10https://gerrit.wikimedia.org/r/1062437 (https://phabricator.wikimedia.org/T372290) (owner: 10EoghanGaffney) [16:25:01] (03Merged) 10jenkins-bot: flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson) [16:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:56:52] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:57:05] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:57:12] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for SaraiSan WMF - https://phabricator.wikimedia.org/T372290#10062382 (10eoghan) a:03eoghan Confirmed that the account was all correct as per [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests | the wiki ]] , and hav... [17:00:05] swfrench-wmf and jeena: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1700) [17:00:46] jeena: I'm here and ready when you are [17:02:39] FYI, folks - we're planning to release a new version of scap that requires a coordinated puppet change. please check here before using scap, until noted otherwise :) [17:05:40] (03PS2) 10Dwisehaupt: Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) [17:06:22] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*: Apply openjdk upgrade — T371874 - eevans@cumin1002 [17:14:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [17:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:03] swfrench-wmf: Almost ready, just waiting for ci jobs to finish [17:20:24] swfrench-wmf: okay, ready now! [17:21:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10062414 (10phaultfinder) [17:21:22] jeena: great! if you want to go ahead and upgrade scap, I'll merge d.ancy's change and run puppet on the deployment host [17:21:33] I'll follow up here when it's safe to test [17:21:53] (03CR) 10Scott French: [C:03+2] scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [17:21:55] 👍 [17:24:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*: Apply openjdk upgrade — T371874 - eevans@cumin1002 [17:24:53] !log jhuneidi@deploy1003 Installing scap version "latest" for 211 hosts [17:25:37] !log jhuneidi@deploy1003 Installation of scap version "latest" completed for 211 hosts [17:26:11] !log run-puppet-agent on deploy1003 to pick up scap.cfg change for T371904 [17:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:23] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [17:26:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874 [17:27:25] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Apply openjdk upgrade — T371874 - eevans@cumin1002 [17:28:17] jeena: scap.cfg is updated now, so I think you should be good to test [17:28:27] thanks! I'll test now [17:28:52] !log jhuneidi@deploy1003 Started scap sync-world: testing T371904 [17:29:41] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:36] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062452 (10Dwisehaupt) @papaul This host is still listed as `frdc2004` in netbox instead of `frdb2004` thus has an incorrect mgmt dns setup. I could rename it and the mgmt in... [17:39:23] !log jhuneidi@deploy1003 Finished scap sync-world: testing T371904 (duration: 10m 31s) [17:39:28] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [17:39:44] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874 [17:40:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874 [17:40:55] swfrench-wmf: all seems well [17:42:44] jeena: great, thanks for driving this :) [17:43:56] thank you! [17:45:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Apply openjdk upgrade — T371874 - eevans@cumin1002 [17:48:28] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062486 (10Jdforrester-WMF) [17:49:25] (03Abandoned) 10Ssingh: Remove admin_state from repository (managed via confd) [dns] - 10https://gerrit.wikimedia.org/r/1062429 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [17:49:31] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062487 (10Dzahn) @Jdforrester-WMF I see your edit. Thanks, got it!:) [17:53:48] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062510 (10Aklapper) @dchan: Unrelated but could you please also [link your LDAP account](https://phabricator.wikimedia.org/settings/panel/external/) to be listed on https://phabricator.wikimedia.... [17:56:47] (03PS1) 10CDanis: tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448 [18:00:04] jeena and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1800). [18:00:34] (03PS1) 10BCornwall: ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449 [18:01:06] o/ [18:01:49] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3625/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall) [18:02:30] (03CR) 10JHathaway: [C:03+1] tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448 (owner: 10CDanis) [18:02:54] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448 (owner: 10CDanis) [18:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 7.345% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:03:22] (03PS2) 10BCornwall: ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449 [18:04:08] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3626/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall) [18:05:22] backporting https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1062284 before train [18:05:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson) [18:05:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:36] ^ Here. [18:07:58] !incidents [18:07:58] 4964 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [18:08:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 5.876s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:09:06] /me too [18:09:34] I will pause backport/train until I get the all-clear. Nothing has deployed yet [18:09:51] (because of the db overload issues) [18:10:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:11:14] !incidents [18:11:14] 4964 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [18:11:15] 4965 (ACKED) db1199 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:15] 4966 (ACKED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:15] 4967 (ACKED) db1238 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:15] 4968 (UNACKED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:15] 4969 (ACKED) db1249 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:16] 4970 (UNACKED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:16] 4971 (UNACKED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:16] 4972 (ACKED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:17] 4973 (UNACKED) db1241 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:17] 4974 (ACKED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:18] 4975 (ACKED) db1190 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:32] !ack 4968 [18:11:33] 4968 (ACKED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [18:11:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [18:11:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:11:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:12:37] this seems...bad [18:12:43] db1160 does appear to be struggling since ~ 18:00 [18:12:47] urandom: Should we depool the replica? [18:12:55] it's the master =/ [18:12:57] db1160 is the s4 primary [18:13:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:13:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 45.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:13:17] okay, can someone take on posting to the status page? [18:13:25] cdanis: On it. [18:14:25] FIRING: [16x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:26] this rhymes pretty closely to what we were seeing on db1238 before it was switched out - IIRC there's a paste somewhere with diagnostic commands [18:14:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:25] I'm catching up on https://phabricator.wikimedia.org/T370304#10001110 [18:15:42] FIRING: [21x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:47] Posted: https://www.wikimediastatus.net/incidents/jhq8qcw6bwz7 [18:15:57] RESOLVED: [5x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:59] thanks denisse, just going to make some small edits :) [18:16:12] Thank you! [18:16:38] I become the IC. [18:16:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [18:16:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:16:51] RESOLVED: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:17:20] Document link: https://docs.google.com/document/d/1lscZB565H5z610ECTpit0lzke-rS3au0Cokn-MAy9xw/edit?usp=sharing [18:18:15] RESOLVED: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 1.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:18:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 2.546s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:19:12] mysql exporter metrics are available again for db1160 as of ~ 18:14, which is consistent with things (e.g., replica lag) starting to recover [18:19:25] RESOLVED: [21x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:35] swfrench-wmf: this time, at least, host metrics were available the whole tiem [18:19:46] https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin&from=1723571107208&orgId=1&to=1723573146824&var-cluster=mysql&var-datasource=thanos&var-server=db1160&viewPanel=13 [18:19:49] this tells part of a story [18:20:19] so does this https://logstash.wikimedia.org/goto/d61dec1cd6c00c43f6454b9ac94a6f7f [18:22:40] cdanis: ah, I didn't realize we even had trouble with host metrics before! but yeah, that makes sense, and is consistent with the jump in mysql-reported connection threads just before things ground to a halt [18:22:59] yeah [18:23:12] https://logstash.wikimedia.org/goto/95c46c61a3894c8a6600065b24b61338 this is the vast majority of the slowlogs at the time [18:23:39] my read of that stack trace is that it's fetching Commons metadata via making an RPC back into mediawiki [18:24:39] I unfortunately did not log into the host soon enough to get a socket dump at the time [18:24:45] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:25:46] cdanis: agreed, yeah [18:36:20] (03Merged) 10jenkins-bot: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson) [18:37:14] denisse: so, I think we can consider the status page incident closed for now, and we should make some more notes on T370304 [18:37:14] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [18:37:48] cdanis: Thanks, I'll close it and add my notes to that task and to the incident document. [18:38:02] thanks! [18:40:41] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:40:46] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:41:38] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:41:43] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:41:58] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:42:04] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:43:14] continuing with train [18:43:58] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]] [18:44:05] T372370: Regression: Icons not visible in dark mode - https://phabricator.wikimedia.org/T372370 [18:46:20] !log jhuneidi@deploy1003 jdlrobson, jhuneidi: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:50:31] !log jhuneidi@deploy1003 jdlrobson, jhuneidi: Continuing with sync [18:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:54:56] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]] (duration: 10m 58s) [18:55:05] T372370: Regression: Icons not visible in dark mode - https://phabricator.wikimedia.org/T372370 [18:58:17] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963) [18:58:19] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [18:59:01] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [19:03:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:57] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10062662 (10phaultfinder) [19:05:47] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.18 refs T366963 [19:05:51] T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963 [19:09:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:37] (03CR) 10Jeena Huneidi: "Thanks Jdlrobson!" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson) [19:18:25] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:07] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:24:31] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:25:10] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:25:15] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:19] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:25:28] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:26:13] (03PS1) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [19:26:54] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:27:28] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards [19:27:31] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [19:28:05] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3628/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:29:31] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3629/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:29:53] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards [19:32:18] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874 [19:41:02] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T372432 (10RobH) 03NEW [19:41:23] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062785 (10RobH) [19:43:23] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062790 (10RobH) [19:43:59] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:44:04] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:44:33] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062796 (10RobH) a:03klausman @klausman: Would you, or someone on your team, please update the puppet repo for these new h... [19:57:47] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [19:57:50] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:02:20] (03PS7) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [20:04:36] (03Abandoned) 10Ryan Kemper: Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper) [20:06:18] (03CR) 10Ryan Kemper: [C:03+2] elastic: run puppet in correct place (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 (owner: 10Ryan Kemper) [20:06:55] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3630/console" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [20:07:15] (03CR) 10Ryan Kemper: [C:03+2] snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [20:09:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10062854 (10VRiley-WMF) I am currently ready for this activity today, or tomorrow. I have the replacement HDD ready [20:10:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10062860 (10VRiley-WMF) Hey @MatthewVernon Thanks. You are correct. As of this moment, we don't have any spare HDDs for this type of device. If this is planned to be in production l... [20:15:31] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 (owner: 10Ryan Kemper) [20:16:12] (03CR) 10Bking: [C:03+1] wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 (owner: 10Ryan Kemper) [20:20:10] (03PS2) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:21:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [20:22:03] !log Update ncmonitor to 1.2.0 via apt1002 [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:25:40] (03PS3) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967 [20:25:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [20:27:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [20:42:22] 06SRE-OnFire, 10Incident Tooling: Harden corto systemd service - https://phabricator.wikimedia.org/T372437 (10BCornwall) 03NEW [20:44:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [20:51:26] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [20:51:29] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [20:55:23] (03PS12) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [20:56:11] (03CR) 10BCornwall: Create corto deployment/configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [20:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:01:01] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [21:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:02:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:14] Is there an issue or recent change to the API? My scripts that I use for quick actions on user accounts aren't working right now... [21:04:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 7.767s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:04:25] Oh nevermind - [741b6205-4d70-4e8e-a8ff-67f13dc52577] 2024-08-13 21:03:54: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [21:04:32] There must be... [21:04:34] :-) [21:05:41] This looks like the same issue we experienced earlier. [21:05:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:06:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:06:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:06:45] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:07:02] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:07:08] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:08:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:09:25] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:42] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:09:50] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:10:29] denisse: same as before... [21:11:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:11:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:12:47] found the paste this time: https://phabricator.wikimedia.org/P67012 [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:51] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:14:07] yep, db1160 is stuck [21:14:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 8.393s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:14:25] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:34] I reckon we should update status, what is the impact of this? [21:14:39] is it limited to commons? [21:15:11] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:15:15] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:15:25] I'd guess not. successful wiki edits are way below normal [21:15:34] yeah, just noticed the same [21:15:48] swfrench-wmf: good find on the paste - are you gathering that data? [21:15:48] no, this is a full editing outage [21:15:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:15:52] I'm gathering a perf [21:16:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:16:16] the mysqld process is spinning at 100% of one core [21:16:30] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:16:34] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:16:39] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:16:45] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:16:57] cwhite: am doing, but I may bet too late [21:17:11] cdanis: awesome, thank you! [21:17:12] I have a perf record from the time of the issue [21:17:33] mysqld is now managing to use all of 2 CPUs .... [21:17:57] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:18:51] RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:19:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 8.232s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:25] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:20:02] denisse: do you want to add this iteration to the previous doc? if so, do you want a hand with the timeline? [21:20:11] there's still many higher than usual sockets in use on db1160 [21:20:18] I have a bunch of timestamped files in my homedir as well [21:20:26] I'll attempt resolving them to pod IPs [21:20:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:20:51] FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:21:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.42% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:21:25] does someone want to update the status page please [21:21:30] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:23:33] cdanis: yes, working on it [21:23:37] i'm a little slow [21:23:50] should I use the text from earlier, or is there more we can say about impact? [21:24:56] text from earlier is fine [21:25:21] ok, done [21:25:42] It looks like its happening again [21:25:51] RESOLVED: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:26:03] seen 4m 39s ago [21:26:19] kamila_: Yes, thank you! [21:26:28] ack denisse [21:26:30] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:27:21] A bit late to the party, anything I can help with? [21:29:04] (03CR) 10LMata: [C:03+1] Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [21:29:20] eoghan: if you can figure out what is making mysqld so unhappy on db1160 then we all would be very grateful [21:29:39] 😅 [21:29:54] I'll get right on that, right after I finish perfecting world peace, k? [21:29:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:30:09] I'll have a look around and see if fresh eyes can help. Reading through the old task now [21:30:32] I'D TAKE A LOOK BUT YOU TOOK AWAY MY ACCESS [21:30:41] BECAUSE OF NDA [21:30:43] :-D [21:30:52] thanks domas [21:31:00] YW! [21:31:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:31:18] No Domas Access? [21:31:48] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:32:25] FIRING: SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:41] the notable thing I'm seeing is a lot of accept4() returning EAGAIN (Resource temporarily unavailable) [21:32:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:32:53] I took an strace on the db because it's not like there's useful throughput happening anyway :) [21:33:14] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:34:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 27.76s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:34:25] FIRING: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:37] the accept4 calls would correlate with the huge bump in socket utilization [21:34:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:35:42] FIRING: [6x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:15] jhathaway: do you know what it means when futex returns EAGAIN [21:36:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:36:30] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:36:36] no not again, I would assume try again? [21:36:49] no not offhand is what i meant [21:36:55] FIRING: [2x] SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:36] I was able to capture a processlist. `show engine innodb status` hangs [21:37:51] FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:38:14] that's great re: processlist [21:38:17] cdanis: vague recollection is that it's when wait "loses" the CAS (i.e., the value has changed) [21:38:32] interesting [21:38:42] that sounds right swfrench-wmf from reading the manpage [21:38:43] cwhite: where's the processlist? [21:38:45] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:38:46] cwhite: awesome, I have a couple as well, spaced a bit apart [21:39:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 21.53s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:39:16] Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both of which appear in different parts of the kernel futex code) have the same value. [21:39:19] swfrench-wmf: can I see one of yours? [21:39:20] cdanis: homedir on cumin2002 [21:39:22] thanks [21:39:25] FIRING: [11x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:30] ^ from manpage [21:39:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [21:39:48] cdanis: same :) [21:39:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:07] (cumin2002 home dir that is) [21:40:29] are all these Sleep processes normal? [21:40:42] RESOLVED: [10x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:41:30] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:41:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:41:50] the "it's actually EWOULDBLOCK" thing would explain the sleeping processes... [21:41:55] RESOLVED: [2x] SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:18] It's resolving. 🥺 [21:42:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:42:51] FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:42:59] db1160 is actually doing network bandwidth again [21:43:14] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:44:03] mysql-reported metrics flowing again [21:44:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.72s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:44:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [21:44:57] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:10] cdanis: I think that's the lazy collection of old client connections left behind (the Sleep states) [21:46:14] at peek we had more the 7.5K connections open, shouldn't we have about the same number of lines in the process list, or did we perhaps capture too late? [21:46:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:46:20] why doesn't the mariadb cli client have switchable output formats [21:46:28] heh [21:46:39] jhathaway: 3.6k is still in the neighborhood [21:47:09] true, compared to our normal usage [21:47:25] innodb status query took 6min to run [21:47:28] heh [21:47:32] yikes [21:47:51] RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:47:51] RESOLVED: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:47:56] I only caught some of it because buffer [21:49:35] looks like it's back [21:49:48] Should we update the status page? [21:50:47] let's give it a few more minutes [21:50:50] https://i.imgur.com/E6yf51u.png [21:51:32] we're rollbacking things on s4? [21:52:58] Not that I'm aware of... [21:53:06] well that's what the perf output makes it looks like [21:53:15] is anyone familiar with the mariadb "performance schema" ? it's apparently already enabled on db1160 [21:54:16] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.173s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:56:20] !log bking@cumin2002 reboot wdqs101[3-5],1018,1020 from DRAC due to unresponsiveness T372442 [21:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:24] T372442: Remediation for unresponsive WDQS hosts wdqs101[3-5],1018,1020 - https://phabricator.wikimedia.org/T372442 [21:56:50] cdanis: That would be transactions that hit the timeout and couldn't be committed, maybe? [21:56:58] hm, maybe. [21:59:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:59:55] I'm about to give up on digging for now, it's quite late here. cwhite swfrench-wmf can you comment on T370304 with links/paths to any artifacts you collected ? [21:59:56] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [22:00:14] cdanis: ack, will do [22:00:21] thanks again for grabbing perf samples [22:00:49] shall we resolve the status page? [22:00:54] I'm out too, I'm not being useful and have an early start tomorrow. Good luck! [22:00:55] +1 [22:01:10] it feels disingenuous, but... [22:02:51] Will do! [22:02:53] urandom: I think it'd be worse to leave it open overnight, or even in 'monitoring' [22:03:17] no, I think it should be marked resolved (for now) [22:03:22] yeah, I agree [22:03:35] it is unfortunate that we don't have, well, much of anything of an explanation, but so it goes sometimes [22:03:36] I just meant that, it's not resolved :) [22:03:40] right. [22:04:44] Thanks all for your help! <3 [22:07:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:09:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:44] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:57:59] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T372445 (10ecarg) 03NEW [23:01:32] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T372445#10063189 (10ecarg) More context: I'm able to get to this step: ` ecarg@deploy1003:~$ curl https://logs-api.svc.eqiad.wmnet/ { "name" : "logstash1031-production-elk7-eqiad", "c... [23:02:36] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T372445#10063190 (10ecarg) [23:20:51] (03PS1) 10Dzahn: firewall: don't throttle untracked connections [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062473 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062473 (owner: 10TrainBranchBot)