[01:17:46] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#11832284 (10Papaul) [01:59:09] 06Traffic, 06SRE: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667 (10jasmine_) 03NEW [03:24:04] 06Traffic, 10ProofreadPage: Page images disappearing on edit - https://phabricator.wikimedia.org/T423548#11832381 (10Vladis13) After clicking on the image, the image container tag changes as follows. https://ru.wikisource.org/w/index.php?title=Страница:Вопросы_жизни._1905._№12.djvu/22&action=edit Was: ` 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T423672 (10Abijithkumar2025) 03NEW Closing this task as invalid due to missing information. [06:26:01] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T423672#11832518 (10Abijithkumar2025) 05Invalid→03Open [06:40:07] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T423672#11832538 (10Aklapper) 05Open→03Invalid @Abijithkumar2025: Again: Please do not create empty tasks but fill out all fields. [07:18:09] 06Traffic, 10ProofreadPage: Page images disappearing on edit - https://phabricator.wikimedia.org/T423548#11832603 (10Xover) I can still reproduce the original problem in latest Safari, Chrome, and Firefox on macOS (latest). Both logged in and in private browsing (logged out). The JS console shows the error "E... [08:18:40] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, and 2 others: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11832698 (10ABran-WMF) I have tried to limit `max_concurrent_streams` to 50, still inconclusive for the connection inter... [08:41:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp3070 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [08:41:43] FIRING: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:42:40] FIRING: [6x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3066:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [08:42:40] FIRING: VarnishChildRestarted: varnish-text restarted on cp3073 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp3073&datasource=esams%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [08:46:43] FIRING: [6x] HaproxyKafkaExporterDown: HaproxyKafka on cp3066 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [08:46:43] FIRING: [14x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:47:40] FIRING: [6x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3066:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [08:47:40] FIRING: [2x] VarnishChildRestarted: varnish-text restarted on cp3072 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [08:50:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp3067:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3067&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:51:43] FIRING: [6x] HaproxyKafkaExporterDown: HaproxyKafka on cp3066 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [08:52:40] FIRING: [6x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3066:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [08:52:40] FIRING: [3x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [08:55:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3067:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3067&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:56:43] FIRING: [6x] HaproxyKafkaExporterDown: HaproxyKafka on cp3066 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [08:56:43] RESOLVED: [14x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:57:40] FIRING: [6x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3066:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [08:57:40] FIRING: [4x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [09:02:40] FIRING: [6x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3066:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:07:40] FIRING: [5x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [09:10:25] FIRING: SystemdUnitFailed: user@499.service on cp3069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:25] RESOLVED: SystemdUnitFailed: user@499.service on cp3069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:58] FIRING: HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3069&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [09:26:58] RESOLVED: HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3069&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [09:32:03] looking at active alerts, I found https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DCheck%20if%20Pybal%20has%20been%20restarted%20after%20pybal.conf%20was%20changed - is that a problem? [09:36:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp3068 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [09:37:40] FIRING: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3068:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:39:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:43] FIRING: [3x] HaproxyKafkaExporterDown: HaproxyKafka on cp3068 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [09:44:57] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:40] RESOLVED: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp3068:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:51:43] RESOLVED: [4x] HaproxyKafkaExporterDown: HaproxyKafka on cp3068 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [10:36:13] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the configured directory - https://phabricator.wikimedia.org/T423689 (10jcrespo) 03NEW [10:36:53] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the configured directory - https://phabricator.wikimedia.org/T423689#11833112 (10jcrespo) Let me know if this box or any other requires investi... [10:38:56] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833117 (10jcrespo) [12:00:12] XioNoX: If you can see the scrollback on this channel for yesterday, that's where I merged a change but requested that someone from traffic hold my hand for the pybal restart. It's for this: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/IPIP#Kubernetes_API_(control_plane) [12:00:48] ok, as long as it's a known thing [12:00:50] I didn't get any volunteers, though. I also mentioned in in -operations when the alert popped up. [12:03:29] Yep, that alert timing definitely matches with my change. I hadn't expected to have to leave it until Monday. I mean, I could restart pybal myself, but in the past the kind folk from traffic have kindly assisted, so I thought it worth checking. [12:22:19] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11833348 (10BTullis) We've deployed a change to the service catalog, but not restarted pybal yet. I was hoping... [13:07:55] FIRING: [5x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [13:40:06] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833550 (10ayounsi) We're not doing Netbox CSV dumps anymore. So you can remo... [13:56:55] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833584 (10jcrespo) That was the only thing being backed up. ` bacula::di... [13:57:40] FIRING: [5x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [14:00:28] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833590 (10ayounsi) Yeah, Postgres is where all the data are. So +1 to not ba... [14:02:40] FIRING: [5x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [14:02:43] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 3 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11833597 (10jcrespo) p:05Triage→03Medium a:03jcrespo [14:07:40] RESOLVED: [5x] VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [14:08:30] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11833605 (10JMeybohm) >>! In T420437#11833348, @BTullis wrote: > We've deployed a change to the service catalog... [14:56:10] 06Traffic, 06Product Safety and Integrity, 07Documentation, 05WE4.2 Bot detection: hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11833812 (10Raine) a:05Raine→03None [15:41:16] fabfur: cp2042 has had puppet disabled for almost a month and is now spamming root about expired certificates, please fix [16:01:19] taavi: ok, it can be dismissed but I still need it and cp2041 to do some tests. If I run the decommission cookbook it also wipes the partitions so I must find a way to remove it from puppet without destroying it [16:39:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724 (10Papaul) 03NEW