[00:07:08] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:13:39] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:15:33] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [00:19:39] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:20:29] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:12:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304647 [01:12:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304647 (owner: 10TrainBranchBot) [01:21:30] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304647 (owner: 10TrainBranchBot) [01:22:39] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:28:29] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:00:30] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:08] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 38s) [02:09:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:39] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:31] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12037972 (10Krinkle) Whenever I update Fresh, I tend to wrongly update at least one of the various commands incorrectly. I then catch myself when I'm reviewing the diff... [02:39:39] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:44:39] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:07:08] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [04:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [07:00:35] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304652 [07:17:05] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:20:05] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:22:05] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:25:05] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:28:11] (03CR) 10Volans: "Have you seen the existing chain of changes starting at" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304617 (owner: 10JHathaway) [07:30:32] (03CR) 10Volans: "FYI Unfortunately I will not have time before my long break to take care of the merge and release of this chain of changes. If any of you " [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1298541 (owner: 10Volans) [07:44:39] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:07:08] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:32:31] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:52:31] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:01:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:44:39] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:03:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:08:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:59:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:04:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:31:02] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:31:31] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:31:32] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:32:02] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [15:44:39] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:58:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-main_443: Servers wdqs1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:58:19] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-main_443: Servers wdqs1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:00:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:00:19] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:09:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:29:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:55:23] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#12038337 (10GPSLeo) >>! In T427949#11984649, @jcrespo wrote: > I don't think anyone is disputing that orthophotos can be educationally useful. The... [17:24:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:29:40] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:33:15] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 20%, RTA = 2134.74 ms [20:35:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:37] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 90%, RTA = 12.29 ms [20:40:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:07:06] 07Puppet, 06Release-Engineering-Team: registry-homepage-builder.py doesn't sort images as expected - https://phabricator.wikimedia.org/T388287#12038469 (10hashar) a:03hashar The original issue (`-s#` suffixed tags coming before suffixless tags) has been fixed (see https://docker-registry.wikimedia.org/dev/bu... [21:07:51] (03PS1) 10Hashar: docker_registry: sort images by version [puppet] - 10https://gerrit.wikimedia.org/r/1304677 (https://phabricator.wikimedia.org/T388287) [22:44:40] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:19:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:24:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:42:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304680 [23:42:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304680 (owner: 10TrainBranchBot) [23:50:17] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304680 (owner: 10TrainBranchBot)