[00:05:25] FIRING: SystemdUnitFailed: clean-stale-certs.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:40] FIRING: SystemdUnitFailed: clean-stale-certs.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:43] FIRING: [8x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [05:22:43] FIRING: [51x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [05:27:43] RESOLVED: [52x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:14:31] ncredir3003, doh3003 and durum3003 will each briefly go down while being switch from DRBD to plain storage as part of the esams migration to routed Ganeti [07:17:01] ack, tnzx [07:17:05] *tnx [07:24:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum3003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:29:00] FIRING: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:34:01] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:49:43] 06Traffic, 06Data-Engineering-Radar, 10HaproxyKafka, 10Observability-Logging, 13Patch-For-Review: Shutdown varnishkafka webrequest instances - https://phabricator.wikimedia.org/T393772#11135033 (10Fabfur) 05In progress→03Resolved With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1183081 I... [07:49:51] 06Traffic, 06Data-Engineering-Radar, 10HaproxyKafka, 10Observability-Logging, 13Patch-For-Review: Shutdown varnishkafka webrequest instances - https://phabricator.wikimedia.org/T393772#11135036 (10Fabfur) [08:05:40] FIRING: SystemdUnitFailed: clean-stale-certs.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:38] 10netops, 06Infrastructure-Foundations: FPC1 Failure on cr1-esams - take 2 - https://phabricator.wikimedia.org/T403360 (10ayounsi) 03NEW p:05Triage→03High [08:46:49] 10netops, 06Infrastructure-Foundations: FPC1 Failure on cr1-esams - take 2 - https://phabricator.wikimedia.org/T403360#11135283 (10ayounsi) JTAC case 2025-0901-838306 opened. [09:32:10] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11135537 (10Vgutierrez) >>! In T400119#11134249, @Don-vip wrote: > I still have [[ https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/jobs/601049 |... [10:50:48] 06Traffic, 06SRE, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11135769 (10Lydia_Pintscher) >>! In T402959#11132802, @CDanis wrote: > Hi @Lydia_Pintscher , SRE can make some... [10:53:39] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11135773 (10Don-vip) >>! In T400119#11135537, @Vgutierrez wrote: > It looks like that test for some reason is using the default UA of the HttpClient librar... [10:57:38] 06Traffic, 10Hiddenparma: Add known-client-ingestion-source objects and logic - https://phabricator.wikimedia.org/T402014#11135798 (10JMeybohm) a:03JMeybohm [11:00:18] 06Traffic, 10Hiddenparma: Add ipblock-source objects and logic - https://phabricator.wikimedia.org/T402014#11135801 (10JMeybohm) [11:23:39] ncredir3004, doh3004 and durum3004 will each briefly go down while being switch from DRBD to plain storage as part of the esams migration to routed Ganeti [11:48:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum3004:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [11:53:00] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3004:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:30:43] FIRING: [10x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [12:35:43] FIRING: [35x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [12:40:43] RESOLVED: [39x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:40:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11136677 (10ayounsi) p:05Triage→03Low [14:40:48] FIRING: PuppetFailure: Puppet has failed on cp1114:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:46:36] ^^ this is resolved [14:46:47] catalog is applied correctly on that host [14:50:48] RESOLVED: PuppetFailure: Puppet has failed on cp1114:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:04:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp5030:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5030 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:09:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp5030:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5030 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:21:41] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11136788 (10DavidBrooks) Re the comment: "Allow user-agents with contact information" - implies blocking UAs with no contact information. Is this referring... [15:25:27] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11136791 (10Vgutierrez) >>! In T400119#11136788, @DavidBrooks wrote: > Re the comment: "Allow user-agents with contact information" - implies blocking UAs... [15:48:43] FIRING: [9x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:53:43] FIRING: [32x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:53:58] FIRING: [32x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:58:43] RESOLVED: [35x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [19:47:18] 10netops, 06Infrastructure-Foundations: FPC1 Failure on cr1-esams - take 2 - https://phabricator.wikimedia.org/T403360#11137375 (10ayounsi) JTAC replied saying that a linecard reboot could fix the issue. I did it (see SAL above), so far still working fine, will check tomorrow morning. [20:46:35] 06Traffic, 10DNS, 06SRE, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#11137439 (10Ijon) 05Open→03Declined Thanks for the ping. We are indeed resolving it by using an address in learn.wiki. This ticket can be closed.