[00:01:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1061572 (owner: 10TrainBranchBot) [00:04:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:32] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:30:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:39:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:40:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:32] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:32] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:17:32] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:30:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:52] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10056457 (10phaultfinder) [02:09:24] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:24] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:44:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:55:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:24] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:30:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:34:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:40:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:44:48] FIRING: PuppetFailure: Puppet has failed on mw1351:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:54:48] RESOLVED: PuppetFailure: Puppet has failed on mw1351:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:30:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:00:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:17:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:29:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:24] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:25] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:24] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:12:26] (03PS1) 10Filippo Giunchedi: mediawiki: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) [07:17:32] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:22:32] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:26:13] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [07:27:32] FIRING: [3x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:32] FIRING: [3x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 - s2', diff saved to https://phabricator.wikimedia.org/P67270 and previous config saved to /var/cache/conftool/dbconfig/20240812-073846-arnaudb.json [07:39:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [07:39:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [07:50:18] (03PS1) 10Filippo Giunchedi: idp: add prometheus OIDC client [puppet] - 10https://gerrit.wikimedia.org/r/1061944 (https://phabricator.wikimedia.org/T326657) [07:55:58] (03CR) 10AOkoth: [C:03+1] Revert "phabricator: delay pages my 30 minutes to reduce alerting noise" [puppet] - 10https://gerrit.wikimedia.org/r/1059840 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [07:56:48] FIRING: PuppetFailure: Puppet has failed on mwdebug1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:04:25] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:06:48] FIRING: PuppetFailure: Puppet has failed on registry1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:06:48] RESOLVED: PuppetFailure: Puppet has failed on mwdebug1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:08:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) (owner: 10NMW03) [08:08:57] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.972s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:10:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:10:42] FIRING: [5x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:13:57] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 8.246s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:16:48] RESOLVED: PuppetFailure: Puppet has failed on registry1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:18:05] (03CR) 10Ayounsi: [C:03+1] "lgtm but I'll let Simon double check." [puppet] - 10https://gerrit.wikimedia.org/r/1061118 (owner: 10Majavah) [08:18:43] jouncebot: nowandnex [08:18:45] jouncebot: nowandnext [08:18:45] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [08:18:45] In 1 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1000) [08:19:04] (03CR) 10Urbanecm: [C:03+2] MenteeOverviewApi: Do not apply undefined/null params [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1061148 (https://phabricator.wikimedia.org/T372164) (owner: 10Urbanecm) [08:20:11] (03PS1) 10Btullis: Switch the new presto nodes to the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1061949 (https://phabricator.wikimedia.org/T370543) [08:20:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:21:55] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:23:45] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.68% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:25:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:30:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:33:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:35:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:35:41] (03PS2) 10Arnaudb: dbproxy: mirrors hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) [08:35:41] (03CR) 10Arnaudb: "fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1055428 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [08:37:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1061148 (https://phabricator.wikimedia.org/T372164) (owner: 10Urbanecm) [08:39:25] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:25] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1061949 (https://phabricator.wikimedia.org/T370543) (owner: 10Btullis) [08:40:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:45] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:41:19] (03PS1) 10Kevin Bazira: ml-services: use cxserver api without trailing slash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061950 (https://phabricator.wikimedia.org/T371465) [08:42:34] (03CR) 10Kevin Bazira: [C:03+2] ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:43:45] (03Merged) 10jenkins-bot: ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:44:47] (03CR) 10GergΕ‘ Tisza: "Can you explain in more detail (in the commit summary, or a code comment, or in the task, so it's easy to find in the future) why argon2 d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [08:45:25] FIRING: [9x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:53:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:55:25] FIRING: [10x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:59] (03Merged) 10jenkins-bot: MenteeOverviewApi: Do not apply undefined/null params [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1061148 (https://phabricator.wikimedia.org/T372164) (owner: 10Urbanecm) [08:57:30] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] [08:57:32] T372164: Special:MentorDashboard broken - https://phabricator.wikimedia.org/T372164 [08:58:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:59:04] (03CR) 10Vgutierrez: varnish: Add restrictive CSP to upload.wikimedia.org and add tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [09:00:25] FIRING: [10x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:16] (03PS1) 10Filippo Giunchedi: Revert "grafana: set thanos as default datasource" [puppet] - 10https://gerrit.wikimedia.org/r/1061955 [09:07:32] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:50] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "grafana: set thanos as default datasource" [puppet] - 10https://gerrit.wikimedia.org/r/1061955 (owner: 10Filippo Giunchedi) [09:08:29] (03CR) 10Klausman: [C:03+1] ml-services: use cxserver api without trailing slash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061950 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [09:09:52] (03PS1) 10AOkoth: Revert "vrts: change root mail alias" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 [09:10:01] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248 (10LSobanski) 03NEW [09:10:16] (03CR) 10Ayounsi: [C:03+1] "+1 one PCC is happy on some random hosts. No need to run it on all the roles." [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [09:10:19] (03CR) 10CI reject: [V:04-1] Revert "vrts: change root mail alias" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 (owner: 10AOkoth) [09:10:25] FIRING: [10x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:41] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:10:49] T372164: Special:MentorDashboard broken - https://phabricator.wikimedia.org/T372164 [09:11:04] !log bounce grafana after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1061955 [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:13] !log urbanecm@deploy1003 urbanecm: Continuing with sync [09:12:41] (03CR) 10Ayounsi: [C:03+1] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:12:44] (03PS2) 10Klausman: hiera/manifest/partman: Add DSE node with GPU [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) [09:15:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:17:24] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] (duration: 19m 54s) [09:17:27] T372164: Special:MentorDashboard broken - https://phabricator.wikimedia.org/T372164 [09:19:18] fabfur kamila_ FYI puppetmaster1001 / puppetmaster1003 are failing their probes and indeed I just got this from grafana1002 [09:19:21] Error: Connection to https://puppetserver1001.eqiad.wmnet:8140/puppet/v3 failed, trying next route: Request to https://puppetserver1001.eqiad.wmnet:8140/puppet/v3 timed out connect operation after 60.103 seconds [09:19:25] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:25] Wrapped exception: [09:19:27] Net::OpenTimeout [09:19:30] yes, that ^ [09:20:47] (03CR) 10Kevin Bazira: [C:03+2] ml-services: use cxserver api without trailing slash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061950 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [09:21:00] huh [09:21:34] (03PS1) 10Brouberol: Revert^2 "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061959 [09:22:11] it's been on for 9h so I doubt things are immediately on fire, but still [09:22:15] (03Merged) 10jenkins-bot: ml-services: use cxserver api without trailing slash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061950 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [09:25:25] FIRING: [9x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:07] godog: where from are you getting the network errors? [09:28:36] kamila_: I saw that on a manual puppet run on grafana1002 [09:28:42] thx [09:28:54] sure np, I've launched another run just now [09:33:47] yeah and this was fine kamila_, definitely intermittent [09:33:53] yeah [09:34:00] haven't managed to reproduce [09:35:08] (03PS2) 10AOkoth: Revert "vrts: change root mail alias" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 [09:37:32] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:51] (03PS1) 10Arnaudb: backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 [09:38:51] (03CR) 10Arnaudb: "same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058295" [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (owner: 10Arnaudb) [09:43:33] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 54994 [09:43:52] (03PS2) 10Arnaudb: backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) [09:45:53] (03CR) 10Btullis: [C:03+1] Revert^2 "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061959 (owner: 10Brouberol) [09:46:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994 [09:46:30] (03CR) 10CI reject: [V:04-1] backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) (owner: 10Arnaudb) [09:48:01] (03CR) 10Hnowlan: "lgtm bar the existing comments about the package" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [09:48:08] kamila_: not sure if related but puppetmaster1003 puppet log shows some 500 errors retrieving facts [09:48:25] (03CR) 10Brouberol: [C:03+2] Revert^2 "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061959 (owner: 10Brouberol) [09:49:25] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:38] vgutierrez: thanks, looking [09:49:51] (03PS1) 10Urbanecm: noc: Fix list of databases in db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) [09:50:04] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10057046 (10ABran-WMF) a:05ABran-WMFβ†’03None Thank you for the explanation πŸ™ >>! In T371984#10048645, @RobH wrote: > This racking task lists you... [09:50:08] kamila_: hmm puppetdb2003 seems to be struggling: https://grafana.wikimedia.org/d/000000477/puppetdb?orgId=1&viewPanel=7 [09:50:25] FIRING: [8x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:29] (03CR) 10Jelto: [C:03+2] "See https://phabricator.wikimedia.org/T371418#10056999 for more context" [puppet] - 10https://gerrit.wikimedia.org/r/1059840 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [09:50:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:48] vgutierrez: interesting, but that's new, the network probes have been failing for longer [09:50:48] (03CR) 10Hnowlan: [C:03+1] mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [09:51:44] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248#10057060 (10ayounsi) a:03ayounsi Emailed AS54994 and cleared the errors for the others. [09:51:58] also eqiad vs codfw, so that's probably unrelated? [09:52:41] yeah.. it looks like puppetdb2003 has "always" been slow [09:53:53] * kamila_ rebooting puppetmaster1001 [09:53:57] actually [09:54:13] !log rebooting puppetmaster1001 due to intermittent network failures [09:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:50] looks like that didn't help [09:57:09] kamila_ what kind of intermittent network issues? [09:58:23] vgutierrez: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=now-12h&to=now select puppetmaster1001, and godog reported intermittent puppet run failures above [09:58:46] (ugh sorry for the ping g.odog) [09:59:04] lol no worries kamila_, all good [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1000) [10:01:45] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:01:46] godog: that probe performs a TLS handshake or a full https request? [10:02:19] vgutierrez: IIRC the latter, I'm double checking [10:03:52] yeah full request vgutierrez [10:04:07] do you know the UA or the path that's using? [10:05:05] path seems to be https://[10.64.16.73]:8140/puppet/v3 [10:05:52] yeah what kamila_ said, UA is sth like "blackbox exporter/" IIRC [10:06:24] the path is the one that is used also by a puppet agent run [10:06:49] (plus minus hostname) [10:07:51] godog: looks like the exporter could be just "Go-http-client/1.1" [10:08:02] (03PS1) 10Brouberol: cloudnative-pg-cluster: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061966 (https://phabricator.wikimedia.org/T368240) [10:08:22] those are the requests coming from prometheus hosts [10:08:40] so I wouldn't say that's a network connectivity issue [10:08:45] yes that's possible too for sure [10:08:46] the probe is getting back some 5xx responses [10:08:56] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061966 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [10:09:17] 2024-08-12T10:07:02 4759895 10.64.16.62 proxy-server/500 531 GET http://puppetmaster1001.eqiad.wmnet/puppet/v3 - text/html - - Go-http-client/1.1 - - - - 10.64.16.62 - - [10:09:22] like that [10:09:22] funny how it starts almost exactly at midnight utc [10:09:26] need to run to lunch, biab [10:09:29] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:18] vgutierrez: yeah, I now believe it's not network [10:11:11] (03CR) 10Lucas Werkmeister (WMDE): Move section mapping to separate file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [10:11:35] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1061956 (owner: 10AOkoth) [10:11:51] kamila_: sounds like the issue is on puppetmaster1003:8141 [10:12:08] kamila_: starting at midnight puppetmaster1001 starts to complaing about puppetmaster1003:8141 [10:12:27] where do you see that vgutierrez ? [10:12:29] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Remove log directive from staging distributions file [puppet] - 10https://gerrit.wikimedia.org/r/1056941 (owner: 10EoghanGaffney) [10:12:32] [Mon Aug 12 00:01:30.875853 2024] [proxy_http:error] [pid 1919] (70007)The timeout specified has expired: [client 2620:0:861:118:10:64:20:60:58306] AH01102: error reading status [10:12:32] line from remote server puppetmaster1003.eqiad.wmnet:8141 [10:12:44] hm [10:12:48] kamila_: /var/log/apache2/error.log in puppetmaster1001 [10:12:58] I can try rebooting 1003 :D [10:13:14] unless you have a better idea [10:14:41] kamila_: errr :) [10:14:45] https://www.irccloud.com/pastebin/DNKDMUPi/ [10:14:54] that's the crash on puppetmaster1003 error.log [10:15:45] it looks like at midnight (for logfile rotation purposes) passenger got reloaded/restarted and it failed to properly spawn the puppet master [10:15:52] oh [10:15:55] [ 2024-08-12 00:00:10.7477 7347/7fc4c404f700 age/Cor/App/Implementation.cpp:304 ]: Could not spawn process for application /usr/share/puppet/rack/puppet-master: An error occurre [10:15:55] d while starting up the preloader. [10:16:18] nice find [10:18:59] if e.g. passenger-spawn-server crashed it would be sufficient to restart apache. [10:19:07] !log restarting apache on puppetmaster1003 [10:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:43] ok... let's see if that improves things [10:19:55] yeah, thanks vgutierrez <3 [10:20:42] FIRING: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:45] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:24:25] RESOLVED: [4x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:32] kamila_: error log is looking good now on both puppetmaster[1003,1003].. and recoveries are coming through [10:24:39] *1001,1003 [10:24:49] nice, gg vgutierrez! [10:26:30] (03CR) 10Zabe: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) (owner: 10Urbanecm) [10:36:15] let's see if it repeats tonight :-D [10:41:09] (03PS1) 10Ilias Sarantopoulos: ml-services: fix wrong 500 errors in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061970 [10:48:22] (03CR) 10Btullis: [C:03+1] hiera/manifest/partman: Add DSE node with GPU [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [10:48:36] (03CR) 10Btullis: [C:03+2] Switch the new presto nodes to the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1061949 (https://phabricator.wikimedia.org/T370543) (owner: 10Btullis) [10:52:15] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10057174 (10BTullis) a:05BTullisβ†’03None >>! In T370543#10054239, @Jclark-ctr wrote: > @BTullis the site.pp roles are incorrect can you update thes... [10:55:31] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1060816 (owner: 10L10n-bot) [10:56:17] (03CR) 10Klausman: [C:03+1] ml-services: fix wrong 500 errors in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061970 (owner: 10Ilias Sarantopoulos) [11:00:42] (03PS2) 10Hnowlan: php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [11:00:44] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: fix wrong 500 errors in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061970 (owner: 10Ilias Sarantopoulos) [11:02:00] (03Merged) 10jenkins-bot: ml-services: fix wrong 500 errors in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061970 (owner: 10Ilias Sarantopoulos) [11:03:26] !log isaranto@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:04:11] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:04:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:06:24] !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:09:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:09:56] (03CR) 10JMeybohm: [C:03+1] php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [11:10:43] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:14:21] (03PS3) 10Arnaudb: backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) [11:16:26] (03PS1) 10AOkoth: vtrs: add confirmation prompt [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) [11:16:42] (03PS3) 10Hnowlan: php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [11:22:19] !log ladsgroup@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [11:24:39] (03CR) 10Hnowlan: [C:03+2] php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [11:24:41] (03CR) 10Hnowlan: [V:03+2 C:03+2] php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [11:26:25] !log rebuilding php7.4-fpm and php7.4-fpm-multiversion-base to pick up healthz worker awareness change (r/1060867) [11:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:12] (03PS2) 10AOkoth: vtrs: add confirmation prompt [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) [11:44:53] (03CR) 10Ladsgroup: [C:03+1] backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) (owner: 10Arnaudb) [11:45:13] (03CR) 10Arnaudb: [C:03+2] backups: adds backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1061961 (https://phabricator.wikimedia.org/T371984) (owner: 10Arnaudb) [11:51:46] vgutierrez, kamila_ just got online, thanks a lot for the puppetmaster fix! [11:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:55:15] I'd need to move the puppet private usage to puppetserver1001 today, I guess it is a good timing [11:55:34] IIUC this was an error with mod_passenger on puppetmaster1003 right? [11:59:49] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:02:37] (03PS1) 10Hnowlan: shellbox: bump image version, support for new healthz param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061978 (https://phabricator.wikimedia.org/T357309) [12:02:53] (03PS1) 10Brouberol: cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) [12:03:41] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:03:43] elukey: yeah, mod_passenger didn't survive the logrotate poke at midnight [12:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:04:05] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [12:04:19] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:04:37] (03PS2) 10Stevemunene: Upgrade airflow test instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1059969 (https://phabricator.wikimedia.org/T365449) [12:04:53] (03PS2) 10Brouberol: cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) [12:05:20] (03CR) 10Btullis: [C:03+1] Upgrade airflow test instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1059969 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [12:05:58] (03CR) 10JMeybohm: [C:03+1] shellbox: bump image version, support for new healthz param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061978 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [12:06:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061980 [12:09:43] (03CR) 10Jelto: [V:03+1 C:03+2] "This should be fine, the root disk for GitLab is significantly bigger. Let's enable the logging and review the logs in tomorrows office ho" [puppet] - 10https://gerrit.wikimedia.org/r/1060131 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [12:11:08] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Openjdk upgrade - elukey@cumin1002 [12:13:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10057365 (10JMeybohm) One thing I've noticed is that kafka-main2010 seems to have a different disk then all the others (all others are 1.7T models): ` sde... [12:15:17] (03CR) 10Elukey: [C:03+1] idp: add prometheus OIDC client [puppet] - 10https://gerrit.wikimedia.org/r/1061944 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:15:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:16:03] (03CR) 10Filippo Giunchedi: [C:03+2] idp: add prometheus OIDC client [puppet] - 10https://gerrit.wikimedia.org/r/1061944 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:16:55] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10057373 (10Jelto) I enabled the throttling rule for the production GitLab instance in logg... [12:17:32] (03PS3) 10Brouberol: cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) [12:18:13] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061985 [12:21:04] (03PS2) 10Elukey: puppetmaster::gitclone: disarm pre-commit and post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) [12:22:16] (03PS1) 10Elukey: Revert^2 "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1061991 [12:25:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:27:00] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:27:15] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: add a namespaced image catalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061979 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:32:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:32:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:35:45] !log restart exim4 on list1004 to pick up the new TLS material [12:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:30] !log restart exim4 on list2001 to pick up the new TLS material [12:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:54] (03PS1) 10Brouberol: cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) [12:39:15] (03PS2) 10Brouberol: cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) [12:44:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10057399 (10ABran-WMF) 05Openβ†’03In progress p:05Triageβ†’03High [12:45:16] (03PS1) 10Elukey: profile::lists: reload exim4 when the TLS material changes [puppet] - 10https://gerrit.wikimedia.org/r/1062003 [12:46:06] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3606/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [12:46:11] (03CR) 10Elukey: profile::lists: reload exim4 when the TLS material changes [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [12:50:00] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3607/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [12:51:30] (03CR) 10Vgutierrez: "patch looks good from the acme_chief perspective, but somebody familiar with lists should say if it's ok to trigger an exim4 reload|restar" [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [12:54:30] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062005 [12:55:04] (03PS3) 10Brouberol: cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) [12:55:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:56:00] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) (owner: 10Brouberol) [12:56:04] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) (owner: 10Brouberol) [12:59:26] (03PS1) 10Klausman: helmfile.d/ml-services: drop NLLB deployments in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062007 [12:59:51] (03PS4) 10Brouberol: cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1300). [13:00:05] Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:35] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) (owner: 10Brouberol) [13:01:28] (03PS5) 10Brouberol: cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) [13:03:27] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: enable the pooler pod to reach the kube API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062002 (https://phabricator.wikimedia.org/T372256) (owner: 10Brouberol) [13:03:39] (03CR) 10Stevemunene: [C:03+2] Upgrade airflow test instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1059969 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [13:05:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:06:31] (03CR) 10Vgutierrez: [C:03+1] profile::lists: reload exim4 when the TLS material changes [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [13:07:01] (03CR) 10JHathaway: [C:03+1] profile::lists: reload exim4 when the TLS material changes [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [13:11:19] (03CR) 10Vgutierrez: [C:04-1] ncmonitor: Set ignored domains configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060891 (https://phabricator.wikimedia.org/T372076) (owner: 10BCornwall) [13:12:34] (03CR) 10Elukey: [V:03+1 C:03+2] profile::lists: reload exim4 when the TLS material changes [puppet] - 10https://gerrit.wikimedia.org/r/1062003 (owner: 10Elukey) [13:15:36] (03CR) 10Ilias Sarantopoulos: [C:03+1] helmfile.d/ml-services: drop NLLB deployments in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062007 (owner: 10Klausman) [13:15:59] (03CR) 10Klausman: [C:03+2] helmfile.d/ml-services: drop NLLB deployments in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062007 (owner: 10Klausman) [13:16:34] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062005 (owner: 10PipelineBot) [13:16:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10057632 (10Jhancock.wm) yes, I have a surplus of 1.7G disks and almost no 1G. so you get a bonus. [13:17:09] (03Merged) 10jenkins-bot: helmfile.d/ml-services: drop NLLB deployments in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062007 (owner: 10Klausman) [13:17:37] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062005 (owner: 10PipelineBot) [13:21:40] (03PS3) 10Ayounsi: provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) [13:22:33] anybody to deploy my patch? [13:24:10] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:24:20] (03PS1) 10CDanis: tracing: tweak samplerates for services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062012 (https://phabricator.wikimedia.org/T320563) [13:24:42] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:25:29] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:26:42] (03PS2) 10CDanis: tracing: tweak samplerates for services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062012 (https://phabricator.wikimedia.org/T320563) [13:29:48] (03CR) 10Elukey: [C:03+2] Revert^2 "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1061991 (owner: 10Elukey) [13:33:01] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:33:25] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [13:34:58] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10057714 (10phaultfinder) [13:36:22] (03CR) 10Hnowlan: [C:03+2] shellbox: bump image version, support for new healthz param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061978 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:36:51] (03CR) 10Ayounsi: [C:03+2] provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) (owner: 10Ayounsi) [13:37:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:21] (03Merged) 10jenkins-bot: shellbox: bump image version, support for new healthz param [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061978 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:38:50] (03CR) 10Elukey: [C:03+2] puppetmaster::gitclone: disarm pre-commit and post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:45:04] (03CR) 10JHathaway: [C:03+1] puppetserver-deploy-code: don't use sudo when checking current branch [puppet] - 10https://gerrit.wikimedia.org/r/1060919 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [13:46:15] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [13:46:57] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [13:49:16] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10057819 (10elukey) Move done, tested the new "disarmed" pre-commit hook on puppetmaster1001 and a commit on puppetserver1001. [13:50:21] o/ [13:50:25] jouncebot: next [13:50:25] In 1 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1530) [13:50:29] Nemoralis: still around? I could deploy now [13:50:40] FIRING: [7x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:50] (03Merged) 10jenkins-bot: provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) (owner: 10Ayounsi) [13:51:09] (03PS1) 10Stevemunene: Upgrade airflow wmde instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062018 (https://phabricator.wikimedia.org/T365449) [13:51:11] (03PS1) 10Stevemunene: Upgrade airflow research instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062019 (https://phabricator.wikimedia.org/T365449) [13:51:13] (03PS1) 10Stevemunene: Upgrade airflow platform_eng instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062020 (https://phabricator.wikimedia.org/T365449) [13:51:15] (03PS1) 10Stevemunene: Upgrade airflow analytics_product instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062021 (https://phabricator.wikimedia.org/T365449) [13:51:17] (03PS1) 10Stevemunene: Upgrade airflow search instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062022 (https://phabricator.wikimedia.org/T365449) [13:51:24] (03PS1) 10Stevemunene: Upgrade airflow analytics instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062023 (https://phabricator.wikimedia.org/T365449) [13:51:32] (03PS1) 10Stevemunene: Upgrade the default airflow version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1062024 (https://phabricator.wikimedia.org/T365449) [13:54:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10057854 (10phaultfinder) [13:56:13] (03CR) 10Andrew Bogott: [C:03+2] puppetserver-deploy-code: don't use sudo when checking current branch [puppet] - 10https://gerrit.wikimedia.org/r/1060919 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [13:57:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Set wgAutoConfirmCount to 10 for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) (owner: 10NMW03) [13:57:32] FIRING: [3x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:32] jouncebot: nowandnext [14:00:32] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [14:00:32] In 1 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1530) [14:00:51] (03CR) 10Zabe: [C:03+2] Further configuration for bdrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061152 (https://phabricator.wikimedia.org/T371760) (owner: 10Zabe) [14:01:38] (03Merged) 10jenkins-bot: Further configuration for bdrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061152 (https://phabricator.wikimedia.org/T371760) (owner: 10Zabe) [14:01:52] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1061152|Further configuration for bdrwiki (T371760)]] [14:01:56] T371760: Post-creation work for bdrwiki - https://phabricator.wikimedia.org/T371760 [14:02:32] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:36] FIRING: [7x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5:00:00 on wdqs1022.eqiad.wmnet with reason: noisy alert, will look at later in the day [14:08:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs1022.eqiad.wmnet with reason: noisy alert, will look at later in the day [14:09:24] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:50] (03PS2) 10Zabe: Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) [14:10:21] (03CR) 10Zabe: "tried doing that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [14:12:36] RESOLVED: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:27] !log zabe@deploy1003 zabe: Backport for [[gerrit:1061152|Further configuration for bdrwiki (T371760)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:51] !log zabe@deploy1003 zabe: Continuing with sync [14:17:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Openjdk upgrade - elukey@cumin1002 [14:19:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:19:56] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10057922 (10phaultfinder) [14:20:41] !incidents [14:20:41] 4960 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [14:20:42] 4959 (RESOLVED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [14:20:47] !ack 4960 [14:20:48] 4960 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [14:20:50] !ack 4959 [14:20:51] Attempt to ack incident 4959 failed. [14:21:08] !incidents [14:21:08] 4960 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [14:21:08] 4959 (RESOLVED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [14:21:38] seems to be eqiad only [14:22:41] no big jump in traffic that I can see... [14:23:00] elukey: it's 5xx's [14:23:00] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1061152|Further configuration for bdrwiki (T371760)]] (duration: 21m 07s) [14:23:06] T371760: Post-creation work for bdrwiki - https://phabricator.wikimedia.org/T371760 [14:23:43] kamila_: yep I see, 503s, I mentioned high traffic to understand if the service was saturated somehow [14:23:48] right [14:24:00] it is ~4 rps so doesn't seem horrible [14:24:14] (thinking out loud, tell me what you think :) [14:26:21] Cc: urandom (for the swift part) [14:27:32] FIRING: [3x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:04] (03PS1) 10Stevemunene: Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) [14:28:07] (thanks elukey <3) [14:28:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:30:05] I am wondering if it is one faulty frontend proxy [14:31:33] I think it's one bad url [14:31:36] so yeah, plausible elukey [14:32:08] I checked https://w.wiki/AtvK and it seems related to multiple frontend proxies [14:32:32] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:40] (03PS2) 10Peter Fischer: EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) [14:33:03] elukey: https://logstash.wikimedia.org/goto/09623019c0025d2e74640a3d8e707716 [14:33:15] several proxies, full of the same path [14:33:19] kamila_: I see a lot of failures for ms-be1078.eqiad.wmnet [14:33:40] I checked the swift proxy logs on ms-fe1011 [14:33:49] maybe one backend is misbehaving? [14:34:14] elukey: where are you looking? [14:34:14] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:34:31] kamila_: swift-proxy.service on ms-fe1011 [14:34:38] journalctl -u swift-proxy -f [14:34:46] a lot of horror connect timeouts for the same ip [14:35:10] and I can't ssh to ms-be1078 [14:35:14] checking the console [14:35:25] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:39] true [14:36:36] kamila_: do you mind to check if the same ip pops up in other swift frontends? [14:36:46] ok, on it elukey [14:36:51] super [14:37:07] (03PS7) 10Brouberol: airflow: add conditional dependency to cloudnative-pg-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) [14:37:16] (03PS1) 10JMeybohm: Add reuse-raid10-6dev profile to be used by new kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062033 (https://phabricator.wikimedia.org/T371423) [14:37:42] yep [14:37:55] elukey: it's popping up everywhere, checked 4 at random [14:38:03] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:38:11] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:38:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:38:26] kamila_: the host somehow doesn't have connectivity, I am logged as root via the mgmt console [14:38:31] interesting [14:38:35] step 1: depool? [14:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:54] kamila_: yes definitely, I am not sure if we can depool a backend though, never done it [14:40:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [14:40:06] oh, I don't see the host in confctl at all [14:40:15] well that complicates things '^^ [14:40:40] swift is different, the backends are not fungible [14:40:40] elukey: do you think it's worth it to try rebooting the host? [14:40:48] (seems like that's all I'm suggesting today '^^) [14:41:00] cdanis: ok, makes sense, thanks [14:41:18] kamila_: it is an option yes, lemme check if I can find a little more and then I'll reboot [14:41:19] (makes sense given the local storage...) [14:41:26] thanks elukey <3 [14:41:42] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:41:48] * kamila_ is going to plop a thingy on statuspage [14:42:19] powercycled [14:42:24] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:42:32] kamila_: shouldn't be necessary [14:42:52] cdanis: ok, I'll hold [14:42:55] !log powercycle ms-be1078 - causing frontend errors in swift-eqiad, network link is down (if down/up didn't work, nothing in the dmesg/syslog) [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:57] thanks elukey [14:43:02] the ATSBackendErrorsHigh alert is quite sensitive IMO [14:43:24] cdanis: in general I agree, but this looks like it's on our end? [14:43:38] kamila_: yeah, but at the same time, swift-proxy should be automatically retrying on other storage nodes [14:43:40] or is your thinking :rather safe than sorry"? [14:43:49] ok [14:43:51] there's no piece of data that's just singly-homed there, and I don't think a few rps of error is worth posting about [14:43:57] * kamila_ doesn't actually know anything about uploads [14:44:17] (03PS1) 10Hnowlan: shellbox, shellbox-video: add support for min_avail_workers, set to 1 for video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062035 (https://phabricator.wikimedia.org/T357309) [14:44:22] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [14:44:37] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: security update - bking@cumin2002 - T371874 [14:44:48] so, I'm not Emperor, but as best I know, next steps for this ms-be node if it doesn't come back up: 1) dcops ticket to get them to check the cables, 2) depending on what's wrong, if it's going to be out of commission for a while, we'll need to edit the rings https://wikitech.wikimedia.org/wiki/Swift/Ring_Management [14:44:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:45:27] (03PS8) 10Brouberol: airflow: add conditional dependency to cloudnative-pg-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) [14:45:34] ^ phew :D [14:45:45] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:46:07] kamila_: interestingly the host is not actually back up :D [14:46:19] uh huh '^^ [14:46:20] yes I was about to say that :D [14:46:30] nvm then '^^ [14:46:37] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:47:30] (errors went down below threshold but are not back to base value, right) [14:48:28] (03PS2) 10Hnowlan: shellbox, shellbox-video: add support for min_avail_workers, set to 1 for video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062035 (https://phabricator.wikimedia.org/T357309) [14:48:32] so ethtool shows a "no link" and I don't see NIC-related errors [14:48:44] maybe it is on the switch side? [14:48:59] (03CR) 10Filippo Giunchedi: [C:03+1] tracing: tweak samplerates for services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062012 (https://phabricator.wikimedia.org/T320563) (owner: 10CDanis) [14:49:24] (03PS9) 10Brouberol: airflow: add conditional dependency to cloudnative-pg-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) [14:49:53] elukey: https://librenms.wikimedia.org/device/device=287/tab=port/port=28404/ [14:49:55] ? [14:50:03] it's been offline for almost a week ? [14:50:38] huh [14:51:03] I don't see tasks opened for it.. [14:51:07] me neither [14:51:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10058039 (10wiki_willy) a:03VRiley-WMF ++ @VRiley-WMF - fyi, this one looks like it's high priority [14:51:29] https://i.imgur.com/WMZjhY3.png icinga agrees [14:51:32] might explain the smaller peaks too [14:51:33] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10058048 (10phaultfinder) [14:51:45] is it a red herring then? [14:52:03] the errors dropped a bit after the reboot [14:52:18] and the ms-fes are full of connect timeouts [14:52:21] elukey: so what I think caused the errors was that other swift servers were probably temporarily slow to respond [14:52:29] (03CR) 10Brouberol: "We're somehow not seeing it in CI, but I can see that the subchart value override work:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062030 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:52:34] I don't know how this fell through the cracks tbh, I thought this was something that DP kept track of? [14:52:45] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy latest outlink version to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062036 (https://phabricator.wikimedia.org/T370408) [14:52:57] cdanis: ok so the theory is that when ms-fe1078 started to be poked more, the errors raised [14:53:15] or it's just noise, swift atsbackend pages fairly often [14:54:50] for now I suggest a dcops ticket to see what's up with the host / its cabling? [14:55:16] it's been down a whole week, I'm not sure if that's more or less of an argument for immediately taking it out of the storage cluster, either [14:55:16] Papaul is already checking on the switch side, will open a task afterwards [14:55:18] ok cool [14:55:28] papaul is? i thought this was an eqiad host :D [14:55:31] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10058059 (10Dzahn) ACK, sounds like a plan. Same for Gerrit, as you will have seen I upped... [14:55:48] cdanis: lol yes i can check that too [14:55:55] I have a theory [14:56:09] there is a spike of requests for a file that lives on that host [14:56:18] lemme see if that's actually true :D [14:56:52] cdanis: Papaul can go everywhere, it is an axiom [14:57:04] kamila_: that would make some sense tbh [14:57:23] sobanski: who is Swift owner while Emperor is out? [14:57:25] kamila_: yep good point [14:58:03] cdanis: probably me, but that doesn't bode well :) [14:58:07] (03CR) 10AOkoth: "Tested with https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging on cumin1002 and works as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/1061973 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:58:14] urandom: hehe [14:58:16] I was about to say that (the first part) [14:58:32] cdanis: switch is showing the link it down [14:58:33] (sorry, been trying to catch up on the backlog, not sure why I missed the notifications) [14:58:40] thanks papaul [14:59:17] (03CR) 10JMeybohm: [C:03+1] mediawiki: bump limit/request for statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [14:59:21] so maybe just a cable issue [14:59:41] that'd be good :D [15:00:00] elukey: kamila_: my advice is to get someone onsite to look at the host, and then, if it's going to be longer than today to fix it, open a task and edit the host to `failed` per https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Removing_a_host [15:00:16] thank you cdanis <3 [15:00:24] +1 yes [15:00:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:03] (03CR) 10Klausman: [C:03+1] ml-services: deploy latest outlink version to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062036 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [15:01:30] each object in Swift is hashed to multiple parts of the 'ring' space; the queries for whatever objects are primary on this storage node are ofc retried elsewhere but at the mercy of whatever healthchecking and other stuff is in the swift frontend, which is waiting for it to come back up at any time (and the backends are not yet re-replicating more copies of this data, if the host is indeed never [15:01:32] coming back) [15:02:00] (03PS1) 10Isabelle Hurbain-Palatin: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 [15:02:29] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy latest outlink version to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062036 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [15:02:55] cdanis: makes sense [15:03:31] (03CR) 10Brouberol: [C:03+1] Temporarily disable gobblin timers to upgrade Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1062031 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [15:04:48] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy latest outlink version to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062036 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [15:05:24] kamila_: at this point we just need a task to dcops, want me to file it? Or are you doing it? [15:05:34] (03Merged) 10jenkins-bot: ml-services: deploy latest outlink version to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062036 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [15:05:53] elukey: if you're up for it, that'd be great, my RSI is kinda bad today [15:06:01] kamila_: doing it! [15:06:04] papaul is already on it, right? [15:06:13] ty elukey <3 [15:06:24] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:06:30] yep yep there is some work on the dcops chan, I think somebody in eqiad needs to check the switch [15:06:47] kamila_: was just checking the status on the switch side [15:06:58] ack, thanks [15:07:14] as far as I know papaul can't astral project himself into eqiad [15:07:16] maybe [15:07:20] fair :D [15:07:29] you know I'm not sure actually [15:07:41] 10ops-eqiad, 06DC-Ops: ms-be1078 has no connectivity - https://phabricator.wikimedia.org/T372289 (10elukey) 03NEW [15:07:53] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:07:53] I can check it [15:07:56] ok, I'll take a typing break for a bit and then pop up to see what the status is and whether we need to remove the host [15:08:00] VRiley: thankss!! Opened https://phabricator.wikimedia.org/T372289 [15:08:07] o/ VRiley, thanks <3 [15:09:31] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for SaraiSan WMF - https://phabricator.wikimedia.org/T372290 (10Sarai-WMF) 03NEW [15:11:40] (03CR) 10JMeybohm: [C:03+1] "While this generally looks good to me I wonder if you could programmatically add one worker in case min_available_workers is set >0. That " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062035 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:13:40] jouncebot: nowandnext [15:13:40] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [15:13:40] In 0 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1530) [15:14:01] (03PS2) 10Urbanecm: noc: Fix list of databases in db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) [15:14:04] (03CR) 10Urbanecm: [C:03+2] noc: Fix list of databases in db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) (owner: 10Urbanecm) [15:14:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) (owner: 10Urbanecm) [15:14:47] (03Merged) 10jenkins-bot: noc: Fix list of databases in db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061965 (https://phabricator.wikimedia.org/T372249) (owner: 10Urbanecm) [15:15:00] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1061965|noc: Fix list of databases in db.php (T372249)]] [15:15:04] T372249: db.php at noc.wikimedia.org does not include information about what wiki is where - https://phabricator.wikimedia.org/T372249 [15:19:06] (03CR) 10Elukey: [C:03+1] Add reuse-raid10-6dev profile to be used by new kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/1062033 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [15:21:32] (03CR) 10Elukey: [C:03+1] Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [15:22:09] (03CR) 10CDanis: [C:03+1] Add mtr to standard packages for WMF hosts [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [15:22:27] jouncebot: nowandnext [15:22:27] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [15:22:27] In 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1530) [15:23:23] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1061965|noc: Fix list of databases in db.php (T372249)]] (duration: 08m 22s) [15:23:26] T372249: db.php at noc.wikimedia.org does not include information about what wiki is where - https://phabricator.wikimedia.org/T372249 [15:23:36] * urbanecm done [15:25:51] (03CR) 10Hnowlan: [C:03+2] "That does seem quite reasonable and a little more intuitive, until the aforementioned resource stuff comes into play and then it's a bit o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062035 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:26:20] (03CR) 10Elukey: [C:03+1] Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney) [15:27:03] (03CR) 10Elukey: [C:04-1] cloud-vps puppetservers: remove use of the 'gitpuppet' user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [15:27:18] (03Merged) 10jenkins-bot: shellbox, shellbox-video: add support for min_avail_workers, set to 1 for video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062035 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:27:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10058200 (10VRiley-WMF) On it, looking into it now [15:27:32] FIRING: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:28:06] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [15:28:36] (03CR) 10Elukey: [C:03+1] Enable validators on Netbox for console(server) and power ports [puppet] - 10https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:29:41] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:02] (03CR) 10Elukey: [C:03+1] Netbox script proxy: set to absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1530). nyaa~ [15:30:25] FIRING: [9x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:32] (03CR) 10Elukey: check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [15:32:32] FIRING: [4x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:44] (03CR) 10Ayounsi: check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [15:34:14] 10ops-eqiad, 06SRE, 06DC-Ops: ms-be1078 has no connectivity - https://phabricator.wikimedia.org/T372289#10058254 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Plugged in cable and checked with Papaul, it seems to be up now. [15:34:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:34:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10058259 (10ABran-WMF) forgot to add: $ sudo megacli -PDList -aALL | grep Slot|sort -k 3 -n Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Number: 3 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slo... [15:34:54] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:35:17] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [15:36:07] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [15:37:32] RESOLVED: [4x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:02] (03CR) 10Scott French: [C:03+1] mwscript_cleanup: Handle when job.status.conditions is None [puppet] - 10https://gerrit.wikimedia.org/r/1060946 (owner: 10RLazarus) [15:40:25] FIRING: [10x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10058288 (10VRiley-WMF) Checked on serial number. Warranty is expired. We have no spare 2TB drives for this unit. Checking options [15:43:56] kamila_: ms-be1078 is up [15:44:10] cool! [15:46:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10058304 (10VRiley-WMF) Spoke to @ABran-WMF about this, we will swap out the drive tomorrow as per instructed. [15:46:08] (03CR) 10Scott French: [C:03+1] "Awesome, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 (owner: 10RLazarus) [15:46:52] (03CR) 10Elukey: [C:03+1] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [15:48:25] jouncebot: next [15:48:25] In 1 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700) [15:48:25] In 1 hour(s) and 11 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700) [15:48:45] (03CR) 10CDanis: [C:03+2] tracing: tweak samplerates for services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062012 (https://phabricator.wikimedia.org/T320563) (owner: 10CDanis) [15:49:59] (03Merged) 10jenkins-bot: tracing: tweak samplerates for services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062012 (https://phabricator.wikimedia.org/T320563) (owner: 10CDanis) [15:50:15] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/apertium: apply [15:50:38] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T372168#10058313 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Checked on this server, and it seems to have been a temporary error. [15:51:14] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/apertium: apply [15:51:15] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/apertium: apply [15:51:24] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10058318 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Attempted to rebalance power. It should be fine. [15:52:08] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [15:52:09] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [15:52:24] (03CR) 10Elukey: sre.dns.admin: add cookbook for GeoDNS pool/depool (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:52:42] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [15:52:44] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [15:53:02] (03CR) 10JHathaway: [C:03+1] data.yaml: Offboarding of mcastro [puppet] - 10https://gerrit.wikimedia.org/r/1059743 (owner: 10Slyngshede) [15:53:16] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [15:53:17] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [15:53:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10058326 (10ABran-WMF) >>! In T370852#10028352, @Marostegui wrote: > @ABran-WMF please coordinate with @cmooney for this. ack, will do! >>! In T3708... [15:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:53:52] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [15:53:53] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [15:54:05] (03CR) 10Elukey: "Left a comment, but LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:54:25] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [15:54:26] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:54:42] (03PS1) 10Ebernhardson: cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062046 [15:54:53] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:54:54] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:55:19] (03CR) 10Elukey: [C:03+1] "Sorry for the delay!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [15:55:25] FIRING: [10x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:43] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@416511b]: (no justification provided) [15:56:23] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@416511b]: (no justification provided) (duration: 00m 40s) [15:56:32] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:57:20] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062046 (owner: 10Ebernhardson) [15:58:26] (03Merged) 10jenkins-bot: cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062046 (owner: 10Ebernhardson) [16:00:23] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:00:34] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:00:37] (03CR) 10CDanis: [C:03+1] "+1 from me" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [16:01:49] (03PS1) 10CDanis: add grafana-rw to tunnelencabulator hosts [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062047 [16:02:45] (03PS1) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) [16:07:14] (03PS4) 10Ayounsi: Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) [16:07:29] (03CR) 10Ayounsi: Add validators for console(server) and power ports (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:07:32] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:00] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062050 (https://phabricator.wikimedia.org/T219903) [16:08:47] (03PS9) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [16:09:45] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:10:03] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:10:16] (03CR) 10Ayounsi: [C:03+2] Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:10:25] FIRING: [10x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:28] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062050 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [16:11:35] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062050 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [16:12:05] (03Merged) 10jenkins-bot: Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:12:36] RESOLVED: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:25] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [16:13:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [16:14:13] (03CR) 10Ayounsi: [C:03+2] Enable validators on Netbox-next for console(server) and power ports [puppet] - 10https://gerrit.wikimedia.org/r/1060435 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:14:21] (03CR) 10Dzahn: [V:04-1] "previous issue fixed but new issue: https://puppet-compiler.wmflabs.org/output/1059418/3608/miscweb1003.eqiad.wmnet/change.miscweb1003.eqi" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:15:44] (03PS10) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [16:16:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:20:07] (03CR) 10Ayounsi: [C:03+2] "Testing successful" [puppet] - 10https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [16:20:52] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1059418/3609/miscweb1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:21:19] (03CR) 10Dzahn: "finally it compiles, always needs the extra line in Hiera with the empty array srange" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [16:22:41] (03CR) 10Elukey: [C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:22:55] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [16:24:19] (03CR) 10Elukey: "Tested locally in this way:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [16:26:30] (03CR) 10Elukey: "Still to solve - the build then fails, I need it to progress." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [16:27:25] (03PS5) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) [16:28:09] (03PS5) 10Ayounsi: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [16:32:19] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:32:35] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:33:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: security update - bking@cumin2002 - T371874 [16:36:11] (03CR) 10FNegri: [C:03+2] "Thanks for the reviews, I'm gonna merge this, then I'll leave it to you if you want to do a release now, or if you want to bundle this wit" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [16:36:27] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:36:36] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:36:40] (03CR) 10FNegri: [V:03+2 C:03+2] Don't use proxy for wikitech-static [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1056132 (owner: 10FNegri) [16:41:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:42:32] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:58] (03CR) 10CI reject: [V:04-1] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:43:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:46:59] jouncebot: nowandnext [16:46:59] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [16:46:59] In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700) [16:46:59] In 0 hour(s) and 13 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700) [16:47:16] (03PS2) 10Urbanecm: [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) [16:47:19] (03CR) 10Urbanecm: [C:03+2] [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [16:47:32] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [16:48:03] (03Merged) 10jenkins-bot: [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [16:48:14] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1060148|[Growth] dewiki: Enable frontend for Add Link (T371597)]] [16:48:18] T371597: Add Link: Release as "turned off" to German Wikipedia - https://phabricator.wikimedia.org/T371597 [16:48:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:49:56] (03PS1) 10Hnowlan: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) [16:50:12] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1060148|[Growth] dewiki: Enable frontend for Add Link (T371597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:51:35] (03CR) 10CI reject: [V:04-1] shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:51:48] !log urbanecm@deploy1003 Sync cancelled. [16:52:06] (03PS1) 10TrainBranchBot: Revert "[Growth] dewiki: Enable frontend for Add Link" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062056 [16:52:06] (03CR) 10TrainBranchBot: "urbanecm@deploy1003 created a revert of this change as I9bb4f11f84ab6af91afe8bdd2b62c8536535cf37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [16:52:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062056 (owner: 10TrainBranchBot) [16:53:00] (03Merged) 10jenkins-bot: Revert "[Growth] dewiki: Enable frontend for Add Link" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062056 (owner: 10TrainBranchBot) [16:53:09] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1062056|Revert "[Growth] dewiki: Enable frontend for Add Link"]] [16:54:43] (03PS2) 10Hnowlan: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) [16:55:13] !log urbanecm@deploy1003 urbanecm, trainbranchbot: Backport for [[gerrit:1062056|Revert "[Growth] dewiki: Enable frontend for Add Link"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:55:21] !log urbanecm@deploy1003 urbanecm, trainbranchbot: Continuing with sync [16:57:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [16:57:33] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10058585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [16:59:48] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1062056|Revert "[Growth] dewiki: Enable frontend for Add Link"]] (duration: 06m 39s) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700) [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T1700). [17:13:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [17:15:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [17:16:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:27:51] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:27:57] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:34:06] (03PS1) 10Andrew Bogott: git-sync-upstream: use sudo for puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1062067 [17:35:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testhost2001.codfw.wmnet with OS bookworm [17:35:26] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10058736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm comple... [17:37:50] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062068 [17:39:01] (03PS1) 10Btullis: Enable the CustomReports plugin [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) [17:40:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3610/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:49:25] (03CR) 10Btullis: Enable the CustomReports plugin [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:54:50] (03CR) 10Brouberol: dns: provision airflow-test-k8s temp domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [17:54:52] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372308 (10phaultfinder) 03NEW [17:55:30] (03CR) 10Brouberol: "The change looks good. I'm trusting you on the config semantics" [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:55:31] (03PS1) 10Urbanecm: [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062072 (https://phabricator.wikimedia.org/T371597) [17:55:34] (03CR) 10Brouberol: [C:03+1] Enable the CustomReports plugin [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:55:46] (03CR) 10Urbanecm: [C:03+2] [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062072 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [17:55:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062072 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [17:55:55] (03CR) 10Btullis: [C:03+2] Enable the CustomReports plugin [puppet] - 10https://gerrit.wikimedia.org/r/1062069 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:56:28] (03Merged) 10jenkins-bot: [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062072 (https://phabricator.wikimedia.org/T371597) (owner: 10Urbanecm) [17:56:39] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1062072|[Growth] dewiki: Enable frontend for Add Link (T371597)]] [17:56:42] T371597: Add Link: Release as "turned off" to German Wikipedia - https://phabricator.wikimedia.org/T371597 [17:57:00] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:57:13] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:58:43] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1062072|[Growth] dewiki: Enable frontend for Add Link (T371597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:02:06] !log urbanecm@deploy1003 urbanecm: Continuing with sync [18:02:07] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:02:18] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:06:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:06:39] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1062072|[Growth] dewiki: Enable frontend for Add Link (T371597)]] (duration: 09m 59s) [18:07:07] T371597: Add Link: Release as "turned off" to German Wikipedia - https://phabricator.wikimedia.org/T371597 [18:09:24] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:16:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:19:39] (03PS2) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [18:24:50] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372308#10058966 (10phaultfinder) [18:25:54] (03CR) 10C. Scott Ananian: [C:03+1] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [18:46:24] (03CR) 10JHathaway: [C:03+1] "looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1059850 (owner: 10Slyngshede) [18:46:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P67274 and previous config saved to /var/cache/conftool/dbconfig/20240812-184639-ladsgroup.json [18:47:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10059098 (10Ladsgroup) 05Openβ†’03Resolved [18:47:09] (03CR) 10JHathaway: [C:03+1] "looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1057862 (owner: 10Slyngshede) [18:48:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1238 (T371342)', diff saved to https://phabricator.wikimedia.org/P67275 and previous config saved to /var/cache/conftool/dbconfig/20240812-184830-ladsgroup.json [18:48:35] T371342: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342 [18:49:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10059119 (10Ladsgroup) I depooled it, so it should be fine but if you need to shut down or reboot it, please let me know beforehand so I can stop mariadb gracefully. [18:58:07] (03CR) 10JHathaway: "I think that would be okay, ideally we would confirm there is no mismatch between the user running the script and the owner of the git dir" [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [19:01:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P67276 and previous config saved to /var/cache/conftool/dbconfig/20240812-190145-ladsgroup.json [19:08:32] (03CR) 10JHathaway: [C:03+1] "any thoughts on how an audit trail will be generated?" [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [19:09:01] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply openjdk upgrade β€” T371874 - eevans@cumin1002 [19:09:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10059215 (10VRiley-WMF) 05Openβ†’03Resolved This drive has been replaced. I will now be closing the ticket [19:10:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10059256 (10VRiley-WMF) Thanks @JMeybohm Currently, at eqiad we don't have many 960 gig SSDs. However, we do have larger sizes. As I understand, t... [19:11:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372308#10059282 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Rebalanced some of the power cables [19:12:32] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:12:37] 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10059265 (10Catrope) 05Openβ†’03Resolved Thank you for weighing in everyone! I think we've gotten enough useful advice here that we c... [19:13:20] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=dewiki --search-index --verbose # T372333, logs available as P67277 [19:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:23] T372333: de.wikipedia: Add Link unavailable due to a high-number of dangling records - https://phabricator.wikimedia.org/T372333 [19:15:19] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=dewiki --force --db-table --verbose # T372333, script started [19:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:37] (03CR) 10JHathaway: [C:03+1] add grafana-rw to tunnelencabulator hosts [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062047 (owner: 10CDanis) [19:16:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P67278 and previous config saved to /var/cache/conftool/dbconfig/20240812-191650-ladsgroup.json [19:21:48] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=dewiki --force --db-table --verbose # T372333, script finished, logs are (gzipped) at F57269843 [19:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:51] T372333: de.wikipedia: Add Link unavailable due to a high-number of dangling records - https://phabricator.wikimedia.org/T372333 [19:23:06] (03CR) 10CDanis: [V:03+2 C:03+2] add grafana-rw to tunnelencabulator hosts [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1062047 (owner: 10CDanis) [19:24:26] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:56] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336 (10phaultfinder) 03NEW [19:29:26] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:42] jouncebot: nowandnext [19:29:42] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [19:29:42] In 0 hour(s) and 30 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T2000) [19:29:51] (03CR) 10Zabe: [C:03+2] Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [19:30:38] (03Merged) 10jenkins-bot: Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [19:30:51] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1061088|Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 (T112359)]] [19:31:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P67279 and previous config saved to /var/cache/conftool/dbconfig/20240812-193157-ladsgroup.json [19:33:04] !log zabe@deploy1003 zabe: Backport for [[gerrit:1061088|Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 (T112359)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:33:34] !log zabe@deploy1003 zabe: Continuing with sync [19:38:00] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1061088|Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 (T112359)]] (duration: 07m 08s) [19:43:57] (03PS28) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:46:45] !log rolling restart of eventgate-main in codfw - T371767 [19:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] T371767: revalidateLinkRecommendations.php fails periodically with JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T371767 [19:47:04] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [19:47:28] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [19:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:58:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T2000). [20:00:05] pfischer and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:01:01] i can deploy [20:01:12] (03PS2) 10NMW03: Set wgAutoConfirmCount to 10 for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) [20:01:13] (03CR) 10Zabe: [C:03+2] Set wgAutoConfirmCount to 10 for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) (owner: 10NMW03) [20:01:20] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:02:06] (03Merged) 10jenkins-bot: Set wgAutoConfirmCount to 10 for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) (owner: 10NMW03) [20:02:20] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1061101|Set wgAutoConfirmCount to 10 for azwiki (T372172)]] [20:02:23] T372172: Set wgAutoConfirmCount to 10 for azwiki - https://phabricator.wikimedia.org/T372172 [20:02:49] (03PS29) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:04:24] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploy to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062101 (https://phabricator.wikimedia.org/T368466) [20:04:27] !log zabe@deploy1003 nmw03, zabe: Backport for [[gerrit:1061101|Set wgAutoConfirmCount to 10 for azwiki (T372172)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:04:38] Nemoralis: can you test? [20:05:04] sure [20:05:28] if it is really possible [20:05:32] https://az.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=autopromote [20:05:33] it is [20:05:37] oh, til :) [20:05:47] same lol [20:05:59] !log zabe@deploy1003 nmw03, zabe: Continuing with sync [20:06:09] seems to be looking good [20:06:59] (03PS30) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:08:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:09:21] pfischer: around? [20:09:30] zabe: yes! [20:09:36] (03CR) 10Zabe: [C:03+2] EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [20:09:39] alright:) [20:10:22] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1061101|Set wgAutoConfirmCount to 10 for azwiki (T372172)]] (duration: 08m 01s) [20:10:25] T372172: Set wgAutoConfirmCount to 10 for azwiki - https://phabricator.wikimedia.org/T372172 [20:10:33] Nemoralis: should be live:) [20:10:41] FIRING: [8x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:44] (03Merged) 10jenkins-bot: EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [20:10:58] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1056944|EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 (T366253)]] [20:11:04] zabe: thanks! [20:11:04] T366253: Create a generic stream to populate CirrusSearch weighted_tags - https://phabricator.wikimedia.org/T366253 [20:12:58] !log zabe@deploy1003 pfischer, zabe: Backport for [[gerrit:1056944|EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 (T366253)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:02] pfischer: is your patch testable? [20:13:39] zabe: yes, I’ll look at streamconfig for meta through debug servers [20:13:46] okay [20:14:16] zabe: looks alright. [20:14:19] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploy to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062101 (https://phabricator.wikimedia.org/T368466) (owner: 10Clare Ming) [20:14:25] cool, syncing [20:14:27] !log zabe@deploy1003 pfischer, zabe: Continuing with sync [20:15:13] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploy to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062101 (https://phabricator.wikimedia.org/T368466) (owner: 10Clare Ming) [20:16:50] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploy to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062105 (https://phabricator.wikimedia.org/T368466) [20:18:53] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1056944|EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 (T366253)]] (duration: 07m 55s) [20:18:56] T366253: Create a generic stream to populate CirrusSearch weighted_tags - https://phabricator.wikimedia.org/T366253 [20:18:58] pfischer: should be live [20:19:14] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:19:32] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:19:46] zabe: great, thank you! meta shows the updated streamsconfig (w/o debug) [20:21:22] yw [20:22:06] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:22:32] FIRING: [4x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:24:25] !log update prefix of wrongly prefixed user password hashes from ':A:' to ':B:' in small batches -- T112359 [20:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:34] (03CR) 10GergΕ‘ Tisza: "Yes, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [20:26:44] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploy to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062105 (https://phabricator.wikimedia.org/T368466) (owner: 10Clare Ming) [20:27:32] FIRING: [4x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:36] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploy to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062105 (https://phabricator.wikimedia.org/T368466) (owner: 10Clare Ming) [20:29:31] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [20:29:43] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [20:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:39:25] (03PS31) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [20:39:39] (03CR) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:45:24] !log upgrading postgresql on puppetdb2003 [20:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:33] !log upgrading postgresql on puppetdb1003 [20:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:59:25] FIRING: ProbeDown: Service restbase1040-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1040-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T2100). Please do the needful. [21:00:42] RESOLVED: ProbeDown: Service restbase1040-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1040-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:21] !log start wrapping type B password hashes with encrypted pbkdf2 in screen - T112359 [21:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:22] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply openjdk upgrade β€” T371874 - eevans@cumin1002 [21:27:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:37:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10060002 (10Dwisehaupt) a:05Dwisehauptβ†’03None Sorry for the delay, I was out last week. This should be fixed. The connections to the mg... [21:38:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10060015 (10Dwisehaupt) [22:09:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:26:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:52:35] (03CR) 10RLazarus: [C:03+2] mwscript_cleanup: Handle when job.status.conditions is None [puppet] - 10https://gerrit.wikimedia.org/r/1060946 (owner: 10RLazarus) [22:53:27] (03CR) 10RLazarus: [C:03+2] mediawiki: Build sidecars annotation dynamically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 (owner: 10RLazarus) [22:55:34] (03Merged) 10jenkins-bot: mediawiki: Build sidecars annotation dynamically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 (owner: 10RLazarus) [22:57:16] jouncebot: nowandnext [22:57:17] For the next 0 hour(s) and 2 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240812T2100) [22:57:17] In 3 hour(s) and 2 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240813T0200) [22:58:19] scapping a change to the MW job template, no effect on production deployments [22:58:59] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1060515 [23:00:40] !log rzl@deploy1003 Finished scap: https://gerrit.wikimedia.org/r/1060515 (duration: 02m 14s) [23:01:26] all done [23:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:16:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:29:26] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:28] (03PS1) 10Dwisehaupt: Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062156 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062156 (owner: 10TrainBranchBot) [23:46:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors