[00:09:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:41] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:11:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062156 (owner: 10TrainBranchBot)
[00:27:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:44:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:56:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:59:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10060266 (10phaultfinder)
[01:00:26] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:08:18] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963)
[01:08:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[01:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.18 [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062172 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0200)
[02:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:39:25] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:26] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:59:25] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0300)
[03:01:22] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963)
[03:01:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[03:02:04] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062213 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[03:02:25] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.18  refs T366963
[03:02:28] <stashbot>	 T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963
[03:25:15] <jinxer-wm>	 FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[03:30:15] <jinxer-wm>	 RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[03:50:52] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap: testwikis to 1.43.0-wmf.18  refs T366963 (duration: 48m 26s)
[03:50:55] <stashbot>	 T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963
[03:53:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0400)
[04:00:56] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.15 (duration: 00m 56s)
[04:27:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:33:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:44:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:55:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:56:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[05:24:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0600)
[06:00:04] <jouncebot>	 marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0600).
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:31:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw1463:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1463 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:48:04] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060428 (10Marostegui) @VRiley-WMF the issue was the disk? There was nothing related to disks on the error log - I am curious to know how was the problem identified.
[06:48:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060432 (10Marostegui) The host also needs to be repooled.
[06:49:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10060429 (10Marostegui) 05Resolved→03Open Reopening only to keep track that we are waiting for an answer on this.
[06:50:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:56:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on mw1463:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1463 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:08:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Will need more testing" [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi)
[07:10:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I don't think I have enough context for review" [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi)
[07:12:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1 master: es1027', diff saved to https://phabricator.wikimedia.org/P67282 and previous config saved to /var/cache/conftool/dbconfig/20240813-071240-arnaudb.json
[07:13:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10060442 (10ABran-WMF) @VRiley-WMF please let me know when you're ready, I'll depool the node then
[07:13:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369 (10dchan) 03NEW
[07:14:03] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi)
[07:20:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox script proxy: set to absent [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi)
[07:24:45] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284
[07:29:03] <wikibugs>	 (03PS2) 10Jdlrobson: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370)
[07:32:24] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi)
[07:32:40] <wikibugs>	 (03PS8) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052)
[07:39:59] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney)
[07:42:40] <wikibugs>	 (03Merged) 10jenkins-bot: Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney)
[07:43:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption
[07:43:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption
[07:47:23] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[07:47:36] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[07:53:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:53:53] <wikibugs>	 (03CR) 10Ayounsi: "Adding myself as CC on this so I don't forget to remove scandium from https://github.com/wikimedia/operations-software-netbox-extras/blob/" [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) (owner: 10Alexandros Kosiaris)
[07:56:08] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) (owner: 10Dwisehaupt)
[08:00:18] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[08:06:18] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342
[08:09:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060574 (10ayounsi) 05Resolved→03Open https://netbox.wikimedia.org/extras/scripts/results/78992/ `cloudcephosd1039 (WMF11571)  /dcim/devices/5296/  Pr...
[08:11:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) (owner: 10Arnaudb)
[08:15:06] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[08:18:05] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[08:19:49] <XioNoX>	 !log upgrade postgresql on netboxdb hosts
[08:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:43] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465)
[08:22:21] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vtrs: add confirmation prompt [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[08:25:38] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos)
[08:25:47] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] Revert "vrts: change root mail alias" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 (owner: 10AOkoth)
[08:26:00] <wikibugs>	 (03Abandoned) 10Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - 10https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi)
[08:26:57] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349
[08:27:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:27:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos)
[08:30:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: enwiki-articlequality increase asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062342 (owner: 10Ilias Sarantopoulos)
[08:34:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349 (owner: 10Arnaudb)
[08:35:06] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: swap es1 master from es1029 to es1027 [dns] - 10https://gerrit.wikimedia.org/r/1062349 (owner: 10Arnaudb)
[08:35:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10060634 (10JMeybohm) Correct. Anything that is at least as big as the ~900G of the four currently installed SSDs will be fine. Thanks!
[08:36:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet
[08:38:36] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1060761 (owner: 10Filippo Giunchedi)
[08:42:50] <wikibugs>	 (03PS2) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760)
[08:43:01] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira)
[08:43:34] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1078 has no connectivity - https://phabricator.wikimedia.org/T372289#10060653 (10MatthewVernon)
[08:43:51] <wikibugs>	 (03CR) 10Stevemunene: dns: provision airflow-test-k8s temp domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[08:46:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060683 (10MatthewVernon)
[08:46:16] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add reuse-raid10-6dev profile to be used by new kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062033 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm)
[08:48:04] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:48:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:49:37] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2002.codfw.wmnet
[08:51:43] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:52:34] <elukey>	 !log upgrade conftool python packages on puppetserver1001 to 3.2.2
[08:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:41] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira)
[08:54:45] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: prod config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062348 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira)
[08:56:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[08:59:02] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:00:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:18] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:02:51] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:03:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow analytics instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062023 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:03:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow search instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062022 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:03:41] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow analytics_product instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062021 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:03:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye
[09:03:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10060810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bu...
[09:04:34] <wikibugs>	 (03PS1) 10Ayounsi: Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890)
[09:08:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow platform_eng instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062020 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:09:02] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow research instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062019 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:09:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow wmde instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062018 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[09:10:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:12:32] <wikibugs>	 (03PS1) 10MVernon: swift: mark ms-be1058 / sdc1 failed [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207)
[09:12:41] <wikibugs>	 (03PS1) 10Ayounsi: Update wheels to pickup new pynetbox version [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1062356 (https://phabricator.wikimedia.org/T371890)
[09:16:47] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060823 (10MatthewVernon)
[09:18:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10060832 (10MatthewVernon) @VRiley-WMF can you confirm my understanding of the state of (lack of) spare drives is correct, please?
[09:19:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Checked that sdc is the broken one" [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207) (owner: 10MVernon)
[09:20:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] dbproxy: mirrors hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb)
[09:22:09] <wikibugs>	 (03PS1) 10Ayounsi: check_netbox_report.py: use venv's python [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890)
[09:23:51] <elukey>	 !log manual run of dump_cloud_ip_ranges.service on puppetserver1001 (failed earlier on)
[09:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:24:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] dbproxy: mirrors hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb)
[09:25:01] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: mark ms-be1058 / sdc1 failed [puppet] - 10https://gerrit.wikimedia.org/r/1062355 (https://phabricator.wikimedia.org/T372207) (owner: 10MVernon)
[09:25:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Sure np, thank you for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1060761 (owner: 10Filippo Giunchedi)
[09:28:07] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimge db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1062360
[09:29:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:29:44] <wikibugs>	 (03CR) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 (034 comments) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[09:30:13] <wikibugs>	 (03PS2) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928)
[09:30:13] <wikibugs>	 (03PS2) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928)
[09:33:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimge db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1062360 (owner: 10Marostegui)
[09:39:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060929 (10dcaro) I got this when trying to set the fqdn (checked others that have the fqdn set on the ipv6, and they don't have the role set, maybe a new...
[09:40:27] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424
[09:40:29] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[09:40:40] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424
[09:41:17] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s5
[09:41:20] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s8
[09:46:54] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1016.eqiad.wmnet with OS bookworm
[09:52:50] <wikibugs>	 (03PS1) 10Ayounsi: ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365
[09:54:54] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye
[09:55:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10060991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error...
[09:55:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365 (owner: 10Ayounsi)
[09:57:20] <wikibugs>	 (03Merged) 10jenkins-bot: ipaddress validator: rename device_role to role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062365 (owner: 10Ayounsi)
[09:58:14] <wikibugs>	 (03CR) 10Wargo: [C:03+1] [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15)
[09:58:35] <wikibugs>	 (03PS3) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928)
[09:58:35] <wikibugs>	 (03PS3) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928)
[09:59:09] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[09:59:27] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage
[09:59:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1000)
[10:00:11] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[10:01:46] <wikibugs>	 (03PS4) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928)
[10:01:47] <wikibugs>	 (03PS4) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928)
[10:02:07] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage
[10:02:50] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[10:05:38] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet
[10:05:51] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-airflow1007.eqiad.wmnet
[10:06:49] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[10:07:23] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet
[10:09:10] <wikibugs>	 (03PS2) 10Klausman: api-gw/liftwing: add missing trailing `/` to path trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465)
[10:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:10:56] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.dns.netbox
[10:11:36] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1007.eqiad.wmnet
[10:12:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow wmde instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062018 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[10:13:28] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5
[10:13:31] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8
[10:15:51] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s8
[10:16:52] <wikibugs>	 (03PS5) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928)
[10:16:52] <wikibugs>	 (03PS5) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928)
[10:18:13] <wikibugs>	 (03CR) 10JMeybohm: "Sorry for the back and forth @swfrench@wikimedia.org - I messed up the initial version of this as it would not cleanly merge into main. I " [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[10:26:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "I apologize for the late review, LGTM! Maybe merge early next week due to (most of) europe short week this week" [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm)
[10:26:54] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1016.eqiad.wmnet with OS bookworm
[10:27:12] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5
[10:27:17] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8
[10:29:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: Prometheus: Add recording rules computing commonly used envoy histograms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm)
[10:32:42] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet
[10:36:08] <wikibugs>	 (03PS1) 10JMeybohm: Don't reuse partitions for initial reimage of new kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062370 (https://phabricator.wikimedia.org/T371423)
[10:38:38] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet
[10:38:45] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added ipv6 entry for cloudcephosd1039 - dcaro@cumin1002"
[10:38:49] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added ipv6 entry for cloudcephosd1039 - dcaro@cumin1002"
[10:38:49] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:38:55] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Don't reuse partitions for initial reimage of new kafka nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062370 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm)
[10:39:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow research instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062019 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[10:49:11] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1004.eqiad.wmnet
[10:53:04] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1004.eqiad.wmnet
[10:53:27] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow platform_eng instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062020 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[10:57:33] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1006.eqiad.wmnet
[10:57:46] <Lucas_WMDE>	 question for the backport window later, especially if rzl is around: is there currently a recommended way to run a maintenance script with mwscript-k8s and dump the output in a file?
[10:58:30] <Lucas_WMDE>	 options I can think of: a) --attach > outfile; b) kubectl [as printed by mwscript-k8s] > outfile; c) don’t use mwscript-k8s for this yet ^^
[11:01:57] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1006.eqiad.wmnet
[11:03:22] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow analytics_product instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062021 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[11:06:08] <Lucas_WMDE>	 (if I don’t hear from anyone I’ll probably go with option c ^^)
[11:07:58] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-airflow1005.eqiad.wmnet
[11:11:52] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1005.eqiad.wmnet
[11:17:58] <XioNoX>	 !log deploy pfw policy update 1723510554 - T372367
[11:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:56] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow search instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062022 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[11:28:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye
[11:28:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bu...
[11:29:08] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-launcher1002.eqiad.wmnet
[11:35:33] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-launcher1002.eqiad.wmnet
[11:37:10] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade airflow analytics instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062023 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[11:38:57] <wikibugs>	 (03PS2) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449)
[11:40:49] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "Hi Jeena: this needs to be merged before we roll out the train." [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson)
[11:42:15] <wikibugs>	 (03PS5) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283)
[11:42:44] <wikibugs>	 (03PS6) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283)
[11:43:01] <wikibugs>	 (03CR) 10Arnaudb: mariadb: observability - adds shard information on recording rule (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb)
[11:43:19] <wikibugs>	 (03PS3) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449)
[11:43:23] <wikibugs>	 (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[11:45:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:46:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[11:48:59] <wikibugs>	 (03PS4) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449)
[11:51:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10061264 (10dcaro) 05Open→03Resolved Done :)
[11:52:01] <wikibugs>	 (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1200)
[12:06:17] <wikibugs>	 (03CR) 10Brouberol: "Actually, I think we should drop this and consider moving the deployment of PG to the airflow helmfile instead. The reason for this is men" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol)
[12:06:25] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Change s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1062379 (https://phabricator.wikimedia.org/T371361)
[12:07:41] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Change s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1062379 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui)
[12:11:07] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1062381 (https://phabricator.wikimedia.org/T372393)
[12:11:12] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393)
[12:13:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061410 (10VRiley-WMF) My apologies! Disregard the drive replaced comment as I meant that for a different ticket. I will be updating the firmware on this device to see if this resolves...
[12:13:49] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061411 (10Marostegui) @VRiley-WMF let us know when you'd like to do this, as we need to switch of MySQL
[12:14:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10061414 (10Marostegui)
[12:16:08] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] api-gw/liftwing: add missing trailing `/` to path trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) (owner: 10Klausman)
[12:18:36] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye
[12:18:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error...
[12:19:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:20:38] <wikibugs>	 (03PS1) 10Stevemunene: Revert "Temporarily disable gobblin timers to upgrade Airflow" [puppet] - 10https://gerrit.wikimedia.org/r/1062387
[12:23:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] git-sync-upstream: use sudo for puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1062067 (owner: 10Andrew Bogott)
[12:23:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:27:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:29:50] <wikibugs>	 (03Abandoned) 10Brouberol: airflow: add conditional dependency to cloudnative-pg-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol)
[12:30:52] <wikibugs>	 (03CR) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[12:31:10] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Revert "Temporarily disable gobblin timers to upgrade Airflow" [puppet] - 10https://gerrit.wikimedia.org/r/1062387 (owner: 10Stevemunene)
[12:36:50] <wikibugs>	 (03PS5) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449)
[12:37:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[12:37:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye
[12:40:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061463 (10JMeybohm) @Jhancock.wm could you please check kafka-main2010 again? After trying to re-image I now only see 5 disks in iDRAC.
[12:40:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:41:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] mediawiki: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi)
[12:42:47] <godog>	 jouncebot: now and next
[12:42:47] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1200)
[12:43:46] <logmsgbot>	 !log filippo@deploy1003 Started scap sync-world: new statsd-exporter limits
[12:46:42] <wikibugs>	 (03CR) 10Elukey: [C:03+1] sre.dns.admin: add cookbook for GeoDNS pool/depool (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[12:47:06] <logmsgbot>	 !log filippo@deploy1003 Finished scap: new statsd-exporter limits (duration: 03m 52s)
[12:48:14] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi)
[12:49:11] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Fine to me, have you ran the script manually to verify that it works?" [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi)
[12:53:38] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10061502 (10cmooney)
[12:56:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:57:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657)
[12:57:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[12:58:08] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10061531 (10elukey) 05Open→03Resolved
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1300).
[13:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:08] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657)
[13:00:18] <MatmaRex>	 hi
[13:00:34] <MatmaRex>	 anyone wants to run a maintenance script for me? :) it should take a few minutes, no more than an hour
[13:00:46] <Lucas_WMDE>	 o/
[13:00:48] <Lucas_WMDE>	 I can :)
[13:01:46] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/cleanupTitles.php --wiki=hewikisource --prefix=T314733 2>&1 | tee ~/T314733.log
[13:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:49] <stashbot>	 T314733: Cleanup leftover pages in deleted namespaces on hewikisource - https://phabricator.wikimedia.org/T314733
[13:02:20] <Lucas_WMDE>	 that sure is printing a lot of output :CatJam:
[13:02:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[13:03:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:04:07] <Lucas_WMDE>	 > all 42,960 of them
[13:04:08] <Lucas_WMDE>	 I see
[13:04:19] <Lucas_WMDE>	 (also, “we were on the verge of greatness, we were this close” yadda yadda)
[13:04:24] <MatmaRex>	 heh
[13:04:58] <Lucas_WMDE>	 !log FINISHED lucaswerkmeister-wmde@mwmaint1002:~$ mwscript maintenance/cleanupTitles.php --wiki=hewikisource --prefix=T314733 2>&1 | tee ~/T314733.log
[13:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:17] <elukey>	 !log `apt-get install python3-conftool python3-conftool-requestctl` on all puppetserver nodes - upgrade to 3.2.2
[13:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:50] <Lucas_WMDE>	 MatmaRex: just to double-check, you want 102k lines of output? ^^
[13:05:52] <Lucas_WMDE>	 (I’ll put them in a paste)
[13:06:42] <MatmaRex>	 Lucas_WMDE: sure. i guess i should have suggested the script option that stops it from printing progress bars
[13:06:59] <Lucas_WMDE>	 eh, the progress bars were nice when they weren’t buried between all the other output ^^
[13:07:10] <MatmaRex>	 but i want to have a record of the page titles it changed, since they're not logged otherwise
[13:07:16] <Lucas_WMDE>	 > File size is too large. See https://www.mediawiki.org/wiki/Phabricator/Help#Uploading_file_attachments
[13:07:18] <Lucas_WMDE>	 boo
[13:07:23] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564)
[13:07:35] <MatmaRex>	 gzip should do it
[13:07:41] * Lucas_WMDE checks if the same applies when uploading as a file instead of paste
[13:07:59] <cdanis>	 files can take arbitrary sizes
[13:08:03] <Lucas_WMDE>	 yup, still too large
[13:08:05] <Lucas_WMDE>	 let’s gzip it then
[13:08:17] <cdanis>	 well, it can take many megabytes at least 😅
[13:09:15] <Lucas_WMDE>	 well, less than ~11M apparently
[13:09:37] <Lucas_WMDE>	 https://phabricator.wikimedia.org/T314733#10061560
[13:09:42] <MatmaRex>	 phab file size limit is 4 MB
[13:09:54] <MatmaRex>	 thanks Lucas_WMDE
[13:11:00] <Lucas_WMDE>	 np
[13:11:16] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[13:11:47] <Lucas_WMDE>	 anything else to deploy?
[13:12:23] <wikibugs>	 (03PS1) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449)
[13:12:38] <Lucas_WMDE>	 wondering if I have anything to backport but I can’t think of something right now
[13:12:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[13:19:05] <Lucas_WMDE>	 Jdlrobson: how about we backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1062284 now?
[13:19:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:19:49] <wikibugs>	 (03PS1) 10Brouberol: airflow: deploy postgresql cluster before airflow itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062397 (https://phabricator.wikimedia.org/T372286)
[13:19:51] <wikibugs>	 (03PS1) 10Brouberol: airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286)
[13:20:34] <wikibugs>	 (03PS2) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449)
[13:20:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol)
[13:21:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto)
[13:21:40] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3616/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[13:22:59] <wikibugs>	 (03PS3) 10Btullis: Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449)
[13:23:41] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3617/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[13:23:42] <wikibugs>	 (03PS2) 10Brouberol: airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286)
[13:25:42] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye
[13:25:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error...
[13:26:26] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[13:26:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye
[13:29:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:58] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564)
[13:35:56] <logmsgbot>	 !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2009.codfw.wmnet with OS bullseye
[13:36:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error...
[13:36:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[13:38:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye
[13:39:55] <wikibugs>	 (03PS1) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[13:40:12] <XioNoX>	 !log update homer wheels - T371890
[13:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:15] <stashbot>	 T371890: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890
[13:40:38] <wikibugs>	 (03CR) 10Marostegui: "Btullis, thanks for working on this part. Next time please wait for any of us to review just in case." [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[13:40:59] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Update wheels for pynetbox and paramiko updates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1062354 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi)
[13:41:48] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update wheels - ayounsi@cumin1002
[13:46:45] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update wheels - ayounsi@cumin1002
[13:47:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto)
[13:47:48] <wikibugs>	 (03CR) 10Ayounsi: "Partially yep, I'll properly test it once I951fda89d553731e7c9fa07fd5214278f69028e9 is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi)
[13:48:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[13:48:25] <logmsgbot>	 !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@109c99e]: Airflow upgrade to v 2.9.3 for analytics instance. T365449.
[13:48:28] <stashbot>	 T365449: Upgrade Airflow to 2.9.3 - https://phabricator.wikimedia.org/T365449
[13:49:06] <logmsgbot>	 !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@109c99e]: Airflow upgrade to v 2.9.3 for analytics instance. T365449. (duration: 00m 40s)
[13:49:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[13:49:33] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Increase analytics postgres max_connections from 100 to 200 [puppet] - 10https://gerrit.wikimedia.org/r/1062395 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[13:49:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene)
[13:51:20] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406 (10MatthewVernon) 03NEW
[13:51:26] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10061765 (10MatthewVernon) p:05Triage→03High
[13:52:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10061766 (10Jhancock.wm) a:03Jhancock.wm
[13:54:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10061768 (10Jhancock.wm) a:03Jhancock.wm
[13:57:15] <Lucas_WMDE>	 !log UTC backport+config window done (since ~13:10, really)
[13:57:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:01] <Lucas_WMDE>	 (I’m still up for deploying that train blocker backport fwiw)
[13:59:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10061783 (10Jhancock.wm) a:03Jhancock.wm
[14:02:07] <wikibugs>	 (03PS1) 10Btullis: Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449)
[14:02:58] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3619/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[14:04:12] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: prep JunOS upgrade cloudsw1-d5-eqiad
[14:04:30] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: prep JunOS upgrade cloudsw1-d5-eqiad
[14:04:56] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 30 hosts with reason: JunOS upgrade cloudsw1-d5-eqiad
[14:05:20] <wikibugs>	 (03PS2) 10Btullis: Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449)
[14:05:22] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 30 hosts with reason: JunOS upgrade cloudsw1-d5-eqiad
[14:06:07] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[14:06:30] <wikibugs>	 (03PS3) 10Jelto: sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564)
[14:06:35] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:06:41] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:06:47] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[14:08:02] <topranks>	 !log rebooting cloudsw1-d5-eqiad to clear errors and upgrade JunOS T371878
[14:08:21] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Include the tuning.conf file in the postgresql configuration [puppet] - 10https://gerrit.wikimedia.org/r/1062410 (https://phabricator.wikimedia.org/T365449) (owner: 10Btullis)
[14:09:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061804 (10Jhancock.wm) I located the missing disk and reseated it. it's showing as having a size of 0.94 GB. Not sure if it's bad or needs to be reformatted. lmk and...
[14:10:09] <wikibugs>	 (03PS87) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[14:10:14] <wikibugs>	 (03CR) 10AOkoth: prometheus: puppetise sql_exporter (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth)
[14:11:52] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet
[14:16:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657)
[14:17:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet
[14:18:40] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:18:45] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:21:10] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/analytics_test@109c99e]: (no justification provided)
[14:21:20] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/analytics_test@109c99e]: (no justification provided) (duration: 00m 09s)
[14:21:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T372160#10061844 (10Jhancock.wm) I found the device with the idrac shut off. tried to start it back up. looks like it tries to boot and then crashes. tried to reset it manually with the i button. I've gotten the idrac to at l...
[14:21:42] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/search@109c99e]: (no justification provided)
[14:22:02] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/search@109c99e]: (no justification provided) (duration: 00m 19s)
[14:22:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Live in Pontoon at https://prometheus-eqiad.o11y.wmcloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi)
[14:22:34] <wikibugs>	 (03PS1) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907)
[14:22:41] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/research@109c99e]: (no justification provided)
[14:22:52] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/research@109c99e]: (no justification provided) (duration: 00m 11s)
[14:23:14] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/platform_eng@109c99e]: (no justification provided)
[14:23:26] <wikibugs>	 (03PS2) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907)
[14:23:38] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/platform_eng@109c99e]: (no justification provided) (duration: 00m 24s)
[14:23:53] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/analytics_product@109c99e]: (no justification provided)
[14:24:02] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/analytics_product@109c99e]: (no justification provided) (duration: 00m 09s)
[14:24:10] <logmsgbot>	 !log btullis@deploy1003 Started deploy [airflow-dags/wmde@109c99e]: (no justification provided)
[14:24:18] <logmsgbot>	 !log btullis@deploy1003 Finished deploy [airflow-dags/wmde@109c99e]: (no justification provided) (duration: 00m 08s)
[14:25:12] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye
[14:25:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error...
[14:26:23] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene)
[14:27:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061861 (10JMeybohm) >>! In T371423#10061804, @Jhancock.wm wrote: > I located the missing disk and reseated it. it's showing as having a size of 0.94 GB. Not sure if i...
[14:28:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "For the record: this did not yield the expected result, will try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi)
[14:29:25] <wikibugs>	 (03PS2) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203)
[14:30:44] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[14:35:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:36:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10061922 (10JMeybohm) Oh, wait. 0.9GB - I totally misread. That is obviously not okay :D Tried rescanning the drives without luck. I would assume it's broken.
[14:36:41] <wikibugs>	 (03PS3) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366)
[15:05:34] <wikibugs>	 (03PS1) 10Gmodena: config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587)
[15:33:34] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add new Frack nodes to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1062433 (owner: 10Papaul)
[15:33:41] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10062135 (10MatthewVernon) Yes, that looks good to me. Thanks for the quick fix :)
[15:34:05] <wikibugs>	 (03PS1) 10Dzahn: gerrit: fix typo in hiera key name for throttling [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259)
[15:34:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10062140 (10Papaul)
[15:34:36] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[15:34:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: fix typo in hiera key name for throttling [puppet] - 10https://gerrit.wikimedia.org/r/1062434 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[15:35:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062143 (10Papaul)
[15:36:29] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: exclude translate_message_group_subscriptions from replication [puppet] - 10https://gerrit.wikimedia.org/r/1062436 (https://phabricator.wikimedia.org/T372287)
[15:37:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10062141 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you
[15:37:51] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ncmonitor: Set ignored domains configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060891 (https://phabricator.wikimedia.org/T372076) (owner: 10BCornwall)
[15:38:07] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2006.codfw.wmnet with OS bookworm
[15:39:18] <wikibugs>	 (03PS1) 10EoghanGaffney: admin: Add sarai-wmf to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1062437 (https://phabricator.wikimedia.org/T372290)
[15:39:50] <mutante>	 !log gerrit - starting to drop packets from abusive sources (T365259)
[15:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:13] <wikibugs>	 (03PS1) 10Ebernhardson: flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438
[15:43:06] <wikibugs>	 (03CR) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:43:12] <wikibugs>	 (03CR) 10Elukey: [C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:43:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062163 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this it is ready for you
[15:44:03] <wikibugs>	 (03PS1) 10Dzahn: gerrit: revert dropping packets from abusive source [puppet] - 10https://gerrit.wikimedia.org/r/1062440 (https://phabricator.wikimedia.org/T365259)
[15:44:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062167 (10Papaul)
[15:44:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062168 (10Papaul) ` papaul@fasw-c-codfw# show | compare    [edit interfaces interface-range disabled] -    member ge-0/0/34; -    mem...
[15:44:45] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10062169 (10elukey) Encountered an issue with the BMC's network config:  ` supermicro_mgmt_network_changes = {     "...
[15:44:55] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: revert dropping packets from abusive source [puppet] - 10https://gerrit.wikimedia.org/r/1062440 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[15:47:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10062170 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt this is ready for you
[15:50:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Looks good, remember you need to restart all sanitarium instances in both eqiad and codfw for every section" [puppet] - 10https://gerrit.wikimedia.org/r/1062436 (https://phabricator.wikimedia.org/T372287) (owner: 10Arnaudb)
[15:55:48] <wikibugs>	 (03CR) 10Scott French: "My apologies, Hugh - I meant to review this yesterday, but apparently I just left the tab open =/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[16:00:05] <jouncebot>	 jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:03:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[16:15:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062253 (10Dzahn) Hi! The request mentions "staff access rights"  but the email address isn't a WMF address. To clarify, is this really FTE staff or a contractor? Thanks!
[16:17:01] <wikibugs>	 (03PS2) 10BCornwall: varnish: Set Cache-Control: no-transform header [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618)
[16:21:10] <wikibugs>	 (03CR) 10Bking: [C:03+1] flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson)
[16:22:49] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on 7 hosts with reason: prep for replacement of cloudsw1-d5-eqiad
[16:23:08] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 7 hosts with reason: prep for replacement of cloudsw1-d5-eqiad
[16:23:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406#10062277 (10Jhancock.wm) 05Open→03Resolved np!
[16:23:54] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson)
[16:24:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, confirmed in BetterWorks" [puppet] - 10https://gerrit.wikimedia.org/r/1062437 (https://phabricator.wikimedia.org/T372290) (owner: 10EoghanGaffney)
[16:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: flink chart: Create a debug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062438 (owner: 10Ebernhardson)
[16:56:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[16:56:52] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:57:05] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:57:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for SaraiSan WMF - https://phabricator.wikimedia.org/T372290#10062382 (10eoghan) a:03eoghan Confirmed that the account was all correct as per [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests | the wiki ]] , and hav...
[17:00:05] <jouncebot>	 swfrench-wmf and jeena: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1700)
[17:00:46] <swfrench-wmf>	 jeena: I'm here and ready when you are
[17:02:39] <swfrench-wmf>	 FYI, folks - we're planning to release a new version of scap that requires a coordinated puppet change. please check here before using scap, until noted otherwise :)
[17:05:40] <wikibugs>	 (03PS2) 10Dwisehaupt: Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630)
[17:06:22] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*: Apply openjdk upgrade — T371874 - eevans@cumin1002
[17:14:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin)
[17:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:20:03] <jeena>	 swfrench-wmf: Almost ready, just waiting for ci jobs to finish
[17:20:24] <jeena>	 swfrench-wmf: okay, ready now!
[17:21:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10062414 (10phaultfinder)
[17:21:22] <swfrench-wmf>	 jeena: great! if you want to go ahead and upgrade scap, I'll merge d.ancy's change and run puppet on the deployment host
[17:21:33] <swfrench-wmf>	 I'll follow up here when it's safe to test
[17:21:53] <wikibugs>	 (03CR) 10Scott French: [C:03+2] scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy)
[17:21:55] <jeena>	 👍
[17:24:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*: Apply openjdk upgrade — T371874 - eevans@cumin1002
[17:24:53] <logmsgbot>	 !log jhuneidi@deploy1003 Installing scap version "latest" for 211 hosts
[17:25:37] <logmsgbot>	 !log jhuneidi@deploy1003 Installation of scap version "latest" completed for 211 hosts
[17:26:11] <swfrench-wmf>	 !log run-puppet-agent on deploy1003 to pick up scap.cfg change for T371904
[17:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:23] <stashbot>	 T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904
[17:26:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874
[17:27:25] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Apply openjdk upgrade — T371874 - eevans@cumin1002
[17:28:17] <swfrench-wmf>	 jeena: scap.cfg is updated now, so I think you should be good to test
[17:28:27] <jeena>	 thanks! I'll test now
[17:28:52] <logmsgbot>	 !log jhuneidi@deploy1003 Started scap sync-world: testing T371904
[17:29:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:32:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10062452 (10Dwisehaupt) @papaul This host is still listed as `frdc2004` in netbox instead of `frdb2004` thus has an incorrect mgmt dns setup. I could rename it and the mgmt in...
[17:39:23] <logmsgbot>	 !log jhuneidi@deploy1003 Finished scap sync-world: testing T371904 (duration: 10m 31s)
[17:39:28] <stashbot>	 T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904
[17:39:44] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874
[17:40:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874
[17:40:55] <jeena>	 swfrench-wmf: all seems well
[17:42:44] <swfrench-wmf>	 jeena: great, thanks for driving this :)
[17:43:56] <jeena>	 thank you!
[17:45:51] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Apply openjdk upgrade — T371874 - eevans@cumin1002
[17:48:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062486 (10Jdforrester-WMF)
[17:49:25] <wikibugs>	 (03Abandoned) 10Ssingh: Remove admin_state from repository (managed via confd) [dns] - 10https://gerrit.wikimedia.org/r/1062429 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[17:49:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062487 (10Dzahn) @Jdforrester-WMF I see your edit. Thanks, got it!:)
[17:53:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10062510 (10Aklapper) @dchan: Unrelated but could you please also [link your LDAP account](https://phabricator.wikimedia.org/settings/panel/external/) to be listed on https://phabricator.wikimedia....
[17:56:47] <wikibugs>	 (03PS1) 10CDanis: tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448
[18:00:04] <jouncebot>	 jeena and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T1800).
[18:00:34] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449
[18:01:06] <jeena>	 o/
[18:01:49] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3625/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall)
[18:02:30] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448 (owner: 10CDanis)
[18:02:54] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add gitlab and idm [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062448 (owner: 10CDanis)
[18:03:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 7.345% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:03:22] <wikibugs>	 (03PS2) 10BCornwall: ncmonitor: Remove duplicate sysuser creation [puppet] - 10https://gerrit.wikimedia.org/r/1062449
[18:04:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3626/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062449 (owner: 10BCornwall)
[18:05:22] <jeena>	 backporting https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1062284 before train
[18:05:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson)
[18:05:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:06:36] <denisse>	 ^ Here.
[18:07:58] <denisse>	 !incidents
[18:07:58] <sirenbot>	 4964 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[18:08:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:08:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 5.876s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:09:06] <urandom>	  /me too
[18:09:34] <jeena>	 I will pause backport/train until I get the all-clear. Nothing has deployed yet
[18:09:51] <jeena>	 (because of the db overload issues)
[18:10:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:11:14] <denisse>	 !incidents
[18:11:14] <sirenbot>	 4964 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[18:11:15] <sirenbot>	 4965 (ACKED)  db1199 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:15] <sirenbot>	 4966 (ACKED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:15] <sirenbot>	 4967 (ACKED)  db1238 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:15] <sirenbot>	 4968 (UNACKED)  db1221 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:15] <sirenbot>	 4969 (ACKED)  db1249 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:16] <sirenbot>	 4970 (UNACKED)  db1242 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:16] <sirenbot>	 4971 (UNACKED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:16] <sirenbot>	 4972 (ACKED)  db1244 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:17] <sirenbot>	 4973 (UNACKED)  db1241 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:17] <sirenbot>	 4974 (ACKED)  db1243 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:18] <sirenbot>	 4975 (ACKED)  db1190 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:32] <denisse>	 !ack 4968
[18:11:33] <sirenbot>	 4968 (ACKED)  db1221 (paged)/MariaDB Replica Lag: s4 (paged)
[18:11:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[18:11:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[18:11:51] <jinxer-wm>	 FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:12:37] <urandom>	 this seems...bad
[18:12:43] <swfrench-wmf>	 db1160 does appear to be struggling since ~ 18:00
[18:12:47] <denisse>	 urandom: Should we depool the replica?
[18:12:55] <swfrench-wmf>	 it's the master =/
[18:12:57] <cdanis>	 db1160 is the s4 primary
[18:13:15] <jinxer-wm>	 FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:13:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 45.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:13:17] <cdanis>	 okay, can someone take on posting to the status page?
[18:13:25] <denisse>	 cdanis: On it.
[18:14:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:14:26] <swfrench-wmf>	 this rhymes pretty closely to what we were seeing on db1238 before it was switched out - IIRC there's a paste somewhere with diagnostic commands
[18:14:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:15:25] <cdanis>	 I'm catching up on https://phabricator.wikimedia.org/T370304#10001110
[18:15:42] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:15:47] <denisse>	 Posted: https://www.wikimediastatus.net/incidents/jhq8qcw6bwz7
[18:15:57] <jinxer-wm>	 RESOLVED: [5x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:15:59] <cdanis>	 thanks denisse, just going to make some small edits :)
[18:16:12] <denisse>	 Thank you!
[18:16:38] <denisse>	 I become the IC.
[18:16:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[18:16:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[18:16:51] <jinxer-wm>	 RESOLVED: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:17:20] <denisse>	 Document link: https://docs.google.com/document/d/1lscZB565H5z610ECTpit0lzke-rS3au0Cokn-MAy9xw/edit?usp=sharing
[18:18:15] <jinxer-wm>	 RESOLVED: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 1.25% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:18:15] <jinxer-wm>	 RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 2.546s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:19:12] <swfrench-wmf>	 mysql exporter metrics are available again for db1160 as of ~ 18:14, which is consistent with things (e.g., replica lag) starting to recover
[18:19:25] <jinxer-wm>	 RESOLVED: [21x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:19:35] <cdanis>	 swfrench-wmf: this time, at least, host metrics were available the whole tiem
[18:19:46] <cdanis>	 https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin&from=1723571107208&orgId=1&to=1723573146824&var-cluster=mysql&var-datasource=thanos&var-server=db1160&viewPanel=13
[18:19:49] <cdanis>	 this tells part of a story
[18:20:19] <cdanis>	 so does this https://logstash.wikimedia.org/goto/d61dec1cd6c00c43f6454b9ac94a6f7f
[18:22:40] <swfrench-wmf>	 cdanis: ah, I didn't realize we even had trouble with host metrics before! but yeah, that makes sense, and is consistent with the jump in mysql-reported connection threads just before things ground to a halt
[18:22:59] <cdanis>	 yeah
[18:23:12] <cdanis>	 https://logstash.wikimedia.org/goto/95c46c61a3894c8a6600065b24b61338 this is the vast majority of the slowlogs at the time
[18:23:39] <cdanis>	 my read of that stack trace is that it's fetching Commons metadata via making an RPC back into mediawiki
[18:24:39] <cdanis>	 I unfortunately did not log into the host soon enough to get a socket dump at the time
[18:24:45] <jinxer-wm>	 RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:25:46] <swfrench-wmf>	 cdanis: agreed, yeah
[18:36:20] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Prevent dark-mode styles from affecting print media" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson)
[18:37:14] <cdanis>	 denisse: so, I think we can consider the status page incident closed for now, and we should make some more notes on T370304
[18:37:14] <stashbot>	 T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304
[18:37:48] <denisse>	 cdanis: Thanks, I'll close it and add my notes to that task and to the incident document.
[18:38:02] <cdanis>	 thanks!
[18:40:41] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:40:46] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:41:38] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:41:43] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:41:58] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:42:04] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:43:14] <jeena>	 continuing with train
[18:43:58] <logmsgbot>	 !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]]
[18:44:05] <stashbot>	 T372370: Regression: Icons not visible in dark mode - https://phabricator.wikimedia.org/T372370
[18:46:20] <logmsgbot>	 !log jhuneidi@deploy1003 jdlrobson, jhuneidi: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:47:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:50:31] <logmsgbot>	 !log jhuneidi@deploy1003 jdlrobson, jhuneidi: Continuing with sync
[18:52:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:54:56] <logmsgbot>	 !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062284|Revert "Prevent dark-mode styles from affecting print media" (T372370)]] (duration: 10m 58s)
[18:55:05] <stashbot>	 T372370: Regression: Icons not visible in dark mode - https://phabricator.wikimedia.org/T372370
[18:58:17] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963)
[18:58:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[18:59:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062455 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot)
[19:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:04:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10062662 (10phaultfinder)
[19:05:47] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.18  refs T366963
[19:05:51] <stashbot>	 T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963
[19:09:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:13:37] <wikibugs>	 (03CR) 10Jeena Huneidi: "Thanks Jdlrobson!" [skins/Vector] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062284 (https://phabricator.wikimedia.org/T372370) (owner: 10Jdlrobson)
[19:18:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:24:07] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:24:31] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:25:10] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:25:15] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:25:19] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:25:28] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:26:13] <wikibugs>	 (03PS1) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[19:26:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[19:27:28] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards
[19:27:31] <stashbot>	 T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754
[19:28:05] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3628/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[19:29:31] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3629/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[19:29:53] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards
[19:32:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: security update - bking@cumin2002 - T371874
[19:41:02] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T372432 (10RobH) 03NEW
[19:41:23] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062785 (10RobH)
[19:43:23] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062790 (10RobH)
[19:43:59] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:44:04] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:44:33] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10062796 (10RobH) a:03klausman @klausman: Would you, or someone on your team, please update the puppet repo for these new h...
[19:57:47] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[19:57:50] <stashbot>	 T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:02:20] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077)
[20:04:36] <wikibugs>	 (03Abandoned) 10Ryan Kemper: Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper)
[20:06:18] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] elastic: run puppet in correct place (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 (owner: 10Ryan Kemper)
[20:06:55] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3630/console" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[20:07:15] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[20:09:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10062854 (10VRiley-WMF) I am currently ready for this activity today, or tomorrow. I have the replacement HDD ready
[20:10:29] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10062860 (10VRiley-WMF) Hey @MatthewVernon Thanks. You are correct. As of this moment, we don't have any spare HDDs for this type of device. If this is planned to be in production l...
[20:15:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 (owner: 10Ryan Kemper)
[20:16:12] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: remove old nginx-level bans [puppet] - 10https://gerrit.wikimedia.org/r/976308 (owner: 10Ryan Kemper)
[20:20:10] <wikibugs>	 (03PS2) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[20:21:14] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[20:22:03] <brett>	 !log Update ncmonitor to 1.2.0 via apt1002
[20:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:25:40] <wikibugs>	 (03PS3) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967
[20:25:53] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[20:27:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:18] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[20:42:22] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Harden corto systemd service - https://phabricator.wikimedia.org/T372437 (10BCornwall) 03NEW
[20:44:59] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[20:51:26] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[20:51:29] <stashbot>	 T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754
[20:55:23] <wikibugs>	 (03PS12) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789)
[20:56:11] <wikibugs>	 (03CR) 10BCornwall: Create corto deployment/configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[20:56:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:01:01] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[21:01:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:02:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:04:14] <Oshwah>	 Is there an issue or recent change to the API? My scripts that I use for quick actions on user accounts aren't working right now...
[21:04:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 7.767s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:04:25] <Oshwah>	 Oh nevermind - [741b6205-4d70-4e8e-a8ff-67f13dc52577] 2024-08-13 21:03:54: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"
[21:04:32] <Oshwah>	 There must be...
[21:04:34] <Oshwah>	 :-)
[21:05:41] <denisse>	 This looks like the same issue we experienced earlier.
[21:05:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[21:06:15] <jinxer-wm>	 FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:06:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:06:45] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[21:07:02] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:07:08] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:08:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:09:25] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:09:42] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:09:50] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:10:29] <urandom>	 denisse: same as before...
[21:11:15] <jinxer-wm>	 RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:11:15] <jinxer-wm>	 FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:12:47] <swfrench-wmf>	 found the paste this time: https://phabricator.wikimedia.org/P67012
[21:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:13:51] <jinxer-wm>	 FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:14:07] <cwhite>	 yep, db1160 is stuck
[21:14:15] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 8.393s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:14:25] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:14:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:34] <urandom>	 I reckon we should update status, what is the impact of this?
[21:14:39] <urandom>	 is it limited to commons?
[21:15:11] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:15:15] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:15:25] <cwhite>	 I'd guess not.  successful wiki edits are way below normal
[21:15:34] <urandom>	 yeah, just noticed the same
[21:15:48] <cwhite>	 swfrench-wmf: good find on the paste - are you gathering that data?
[21:15:48] <cdanis>	 no, this is a full editing outage
[21:15:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:15:52] <cdanis>	 I'm gathering a perf
[21:16:15] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:16:16] <cdanis>	 the mysqld process is spinning at 100% of one core
[21:16:30] <jinxer-wm>	 FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:16:34] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:16:39] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:16:45] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[21:16:57] <swfrench-wmf>	 cwhite: am doing, but I may bet too late
[21:17:11] <swfrench-wmf>	 cdanis: awesome, thank you!
[21:17:12] <cdanis>	 I have a perf record from the time of the issue
[21:17:33] <cdanis>	 mysqld is now managing to use all of 2 CPUs .... 
[21:17:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:18:51] <jinxer-wm>	 RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:19:16] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 8.232s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:19:25] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:20:02] <kamila_>	 denisse: do you want to add this iteration to the previous doc? if so, do you want a hand with the timeline?
[21:20:11] <cdanis>	 there's still many higher than usual sockets in use on db1160
[21:20:18] <cdanis>	 I have a bunch of timestamped files in my homedir as well
[21:20:26] <cdanis>	 I'll attempt resolving them to pod IPs
[21:20:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[21:20:51] <jinxer-wm>	 FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:21:15] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.42% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:21:25] <cdanis>	 does someone want to update the status page please
[21:21:30] <jinxer-wm>	 FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:23:33] <urandom>	 cdanis: yes, working on it
[21:23:37] <urandom>	 i'm a little slow
[21:23:50] <urandom>	 should I use the text from earlier, or is there more we can say about impact?
[21:24:56] <cdanis>	 text from earlier is fine
[21:25:21] <urandom>	 ok, done
[21:25:42] <cwhite>	 It looks like its happening again
[21:25:51] <jinxer-wm>	 RESOLVED: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:26:03] <cwhite>	 seen 4m 39s ago
[21:26:19] <denisse>	 kamila_: Yes, thank you!
[21:26:28] <kamila_>	 ack denisse 
[21:26:30] <jinxer-wm>	 RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:27:21] <eoghan>	 A bit late to the party, anything I can help with? 
[21:29:04] <wikibugs>	 (03CR) 10LMata: [C:03+1] Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[21:29:20] <cdanis>	 eoghan: if you can figure out what is making mysqld so unhappy on db1160 then we all would be very grateful
[21:29:39] <cdanis>	 😅
[21:29:54] <eoghan>	 I'll get right on that, right after I finish perfecting world peace, k? 
[21:29:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:30:09] <eoghan>	 I'll have a look around and see if fresh eyes can help. Reading through the old task now
[21:30:32] <domas>	 I'D TAKE A LOOK BUT YOU TOOK AWAY MY ACCESS 
[21:30:41] <domas>	 BECAUSE OF NDA
[21:30:43] <domas>	 :-D
[21:30:52] <cdanis>	 thanks domas 
[21:31:00] <domas>	 YW!
[21:31:15] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:31:18] <eoghan>	 No Domas Access? 
[21:31:48] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[21:32:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:32:41] <cdanis>	 the notable thing I'm seeing is a lot of accept4() returning EAGAIN (Resource temporarily unavailable)
[21:32:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:32:53] <cdanis>	 I took an strace on the db because it's not like there's useful throughput happening anyway :)
[21:33:14] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[21:34:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 27.76s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:34:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:37] <jhathaway>	 the accept4 calls would correlate with the huge bump in socket utilization
[21:34:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:35:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:36:15] <cdanis>	 jhathaway: do you know what it means when futex returns EAGAIN
[21:36:15] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:36:30] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:36:36] <jhathaway>	 no not again, I would assume try again?
[21:36:49] <jhathaway>	 no not offhand is what i meant
[21:36:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:37:36] <cwhite>	 I was able to capture a processlist.  `show engine innodb status` hangs
[21:37:51] <jinxer-wm>	 FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:38:14] <cdanis>	 that's great re: processlist
[21:38:17] <swfrench-wmf>	 cdanis: vague recollection is that it's when wait "loses" the CAS (i.e., the value has changed)
[21:38:32] <cdanis>	 interesting
[21:38:42] <jhathaway>	 that sounds right swfrench-wmf from reading the manpage
[21:38:43] <cdanis>	 cwhite: where's the processlist?
[21:38:45] <jinxer-wm>	 FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:38:46] <swfrench-wmf>	 cwhite: awesome, I have a couple as well, spaced a bit apart
[21:39:16] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 21.53s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:39:16] <kamila_>	 Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both of which appear in different parts of the kernel futex code) have the same value.
[21:39:19] <cdanis>	 swfrench-wmf: can I see one of yours?
[21:39:20] <cwhite>	 cdanis: homedir on cumin2002
[21:39:22] <cdanis>	 thanks
[21:39:25] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:39:30] <kamila_>	 ^ from manpage
[21:39:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[21:39:48] <swfrench-wmf>	 cdanis: same :)
[21:39:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:40:07] <swfrench-wmf>	 (cumin2002 home dir that is)
[21:40:29] <cdanis>	 are all these Sleep processes normal?
[21:40:42] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:41:15] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:41:30] <jinxer-wm>	 RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:41:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[21:41:50] <kamila_>	 the "it's actually EWOULDBLOCK" thing would explain the sleeping processes...
[21:41:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: user@499.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:42:18] <denisse>	 It's resolving. 🥺
[21:42:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:42:51] <jinxer-wm>	 FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:42:59] <cdanis>	 db1160 is actually doing network bandwidth again
[21:43:14] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[21:44:03] <swfrench-wmf>	 mysql-reported metrics flowing again
[21:44:16] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.72s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:44:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[21:44:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:45:10] <swfrench-wmf>	 cdanis: I think that's the lazy collection of old client connections left behind (the Sleep states)
[21:46:14] <jhathaway>	 at peek we had more the 7.5K connections open, shouldn't we have about the same number of lines in the process list, or did we perhaps capture too late?
[21:46:15] <jinxer-wm>	 RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:46:20] <cdanis>	 why doesn't the mariadb cli client have switchable output formats
[21:46:28] <swfrench-wmf>	 heh
[21:46:39] <cdanis>	 jhathaway: 3.6k is still in the neighborhood
[21:47:09] <jhathaway>	 true, compared to our normal usage
[21:47:25] <cwhite>	 innodb status query took 6min to run
[21:47:28] <cdanis>	 heh
[21:47:32] <inflatador>	 yikes
[21:47:51] <jinxer-wm>	 RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:47:51] <jinxer-wm>	 RESOLVED: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:47:56] <cwhite>	 I only caught some of it because buffer
[21:49:35] <cwhite>	 looks like it's back
[21:49:48] <denisse>	 Should we update the status page?
[21:50:47] <cdanis>	 let's give it a few more minutes
[21:50:50] <cdanis>	 https://i.imgur.com/E6yf51u.png
[21:51:32] <cdanis>	 we're rollbacking things on s4?
[21:52:58] <denisse>	 Not that I'm aware of...
[21:53:06] <cdanis>	 well that's what the perf output makes it looks like
[21:53:15] <cdanis>	 is anyone familiar with the mariadb "performance schema" ?  it's apparently already enabled on db1160
[21:54:16] <jinxer-wm>	 RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.173s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:56:20] <inflatador>	 !log bking@cumin2002 reboot wdqs101[3-5],1018,1020 from DRAC due to unresponsiveness T372442
[21:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:24] <stashbot>	 T372442: Remediation for unresponsive WDQS hosts wdqs101[3-5],1018,1020 - https://phabricator.wikimedia.org/T372442
[21:56:50] <eoghan>	 cdanis: That would be transactions that hit the timeout and couldn't be committed, maybe? 
[21:56:58] <cdanis>	 hm, maybe.
[21:59:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[21:59:55] <cdanis>	 I'm about to give up on digging for now, it's quite late here.  cwhite swfrench-wmf can you comment on T370304 with links/paths to any artifacts you collected ? 
[21:59:56] <stashbot>	 T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304
[22:00:14] <swfrench-wmf>	 cdanis: ack, will do
[22:00:21] <swfrench-wmf>	 thanks again for grabbing perf samples
[22:00:49] <urandom>	 shall we resolve the status page?
[22:00:54] <eoghan>	 I'm out too, I'm not being useful and have an early start tomorrow. Good luck! 
[22:00:55] <cdanis>	 +1
[22:01:10] <urandom>	 it feels disingenuous, but...
[22:02:51] <cwhite>	 Will do!
[22:02:53] <cdanis>	 urandom: I think it'd be worse to leave it open overnight, or even in 'monitoring'
[22:03:17] <urandom>	 no, I think it should be marked resolved (for now)
[22:03:22] <cdanis>	 yeah, I agree
[22:03:35] <cdanis>	 it is unfortunate that we don't have, well, much of anything of an explanation, but so it goes sometimes
[22:03:36] <urandom>	 I just meant that, it's not resolved :)
[22:03:40] <urandom>	 right.
[22:04:44] <denisse>	 Thanks all for your help! <3
[22:07:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:09:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:09:44] <jinxer-wm>	 RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[22:57:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <localhost port 9200> for <ecarg> - https://phabricator.wikimedia.org/T372445 (10ecarg) 03NEW
[23:01:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <localhost port 9200> for <ecarg> - https://phabricator.wikimedia.org/T372445#10063189 (10ecarg) More context:   I'm able to get to this step:  ` ecarg@deploy1003:~$ curl https://logs-api.svc.eqiad.wmnet/ {   "name" : "logstash1031-production-elk7-eqiad",   "c...
[23:02:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <LogStash server or localhost port 9200> for <ecarg> - https://phabricator.wikimedia.org/T372445#10063190 (10ecarg)
[23:20:51] <wikibugs>	 (03PS1) 10Dzahn: firewall: don't throttle untracked connections [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259)
[23:38:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062473
[23:38:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062473 (owner: 10TrainBranchBot)