[00:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:43] (03PS1) 10Zabe: Disable QueryPage updates for Special:Unusedtemplates on testscommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267286 (https://phabricator.wikimedia.org/T422062) [00:07:08] (03CR) 10Dzahn: [C:03+2] "still need to follow-up because:" [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [00:09:26] (03PS1) 10Eevans: linked_artifacts: remove temorary egress acl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267287 (https://phabricator.wikimedia.org/T421444) [00:09:47] (03PS2) 10Zabe: Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267286 (https://phabricator.wikimedia.org/T422062) [00:10:59] (03PS2) 10Eevans: linked_artifacts: remove temporary egress acl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267287 (https://phabricator.wikimedia.org/T421444) [00:12:41] (03PS1) 10Eevans: cassandra-dev: remove temporary configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267289 (https://phabricator.wikimedia.org/T421444) [00:15:34] (03CR) 10Eevans: [C:04-1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267287 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [00:15:43] (03CR) 10Eevans: [C:04-1] cassandra-dev: remove temporary configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267289 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [00:16:26] (03PS1) 10Dzahn: ci: add missing .gpg extension to key file name [puppet] - 10https://gerrit.wikimedia.org/r/1267290 (https://phabricator.wikimedia.org/T418109) [00:17:13] (03CR) 10Dzahn: [C:03+2] ci: add missing .gpg extension to key file name [puppet] - 10https://gerrit.wikimedia.org/r/1267290 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [00:17:21] (03PS2) 10Dzahn: ci: add missing .gpg extension to key file name [puppet] - 10https://gerrit.wikimedia.org/r/1267290 (https://phabricator.wikimedia.org/T418109) [00:19:49] (03CR) 10Dzahn: [C:03+2] ci: add missing .gpg extension to key file name [puppet] - 10https://gerrit.wikimedia.org/r/1267290 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [00:44:06] (03CR) 10Zabe: [C:03+2] Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267286 (https://phabricator.wikimedia.org/T422062) (owner: 10Zabe) [00:44:09] (03PS1) 10Dzahn: Revert^4 "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1267301 [00:45:58] (03Merged) 10jenkins-bot: Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267286 (https://phabricator.wikimedia.org/T422062) (owner: 10Zabe) [00:51:18] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1267286|Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki (T422062)]] [00:51:21] T422062: MediaWiki periodic job update-special-pages-s4 failed - https://phabricator.wikimedia.org/T422062 [00:53:28] !log zabe@deploy1003 zabe: Backport for [[gerrit:1267286|Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki (T422062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:53:55] !log zabe@deploy1003 zabe: Continuing with sync [00:58:08] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267286|Disable QueryPage updates for Special:Unusedtemplates on testcommonswiki (T422062)]] (duration: 06m 50s) [00:58:11] T422062: MediaWiki periodic job update-special-pages-s4 failed - https://phabricator.wikimedia.org/T422062 [00:59:11] !log zabe@deploy1003:~$ mwscript updateSpecialPages.php testcommonswiki # T422062 [00:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1267312 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1267312 (owner: 10TrainBranchBot) [01:22:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1267312 (owner: 10TrainBranchBot) [02:00:48] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:09] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 21s) [02:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:04] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:37:04] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:40:50] (03PS1) 10Andrew Bogott: magnum/codfw1dev: try using the same chart repo as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1267436 [03:41:31] (03CR) 10Andrew Bogott: [C:03+2] magnum/codfw1dev: try using the same chart repo as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1267436 (owner: 10Andrew Bogott) [03:42:05] (03PS2) 10Andrew Bogott: magnum/codfw1dev: try using the same chart repo as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1267436 [03:47:05] (03CR) 10Andrew Bogott: [C:03+2] magnum/codfw1dev: try using the same chart repo as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1267436 (owner: 10Andrew Bogott) [03:50:11] (03PS1) 10Krinkle: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) [03:50:24] (03PS2) 10Krinkle: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) [03:50:37] (03PS3) 10Krinkle: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) [03:50:50] (03PS1) 10Krinkle: WIP test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267438 [03:51:47] (03CR) 10CI reject: [V:04-1] Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [03:51:48] (03CR) 10CI reject: [V:04-1] WIP test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267438 (owner: 10Krinkle) [03:52:01] (03Abandoned) 10Krinkle: WIP test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267438 (owner: 10Krinkle) [03:53:36] (03PS4) 10Krinkle: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) [03:56:00] (03PS1) 10C. Scott Ananian: ParserMigration: transition to new configuration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 [04:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:51:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:56:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:00:20] ! [05:01:03] !incidents [05:01:03] 7808 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [05:01:03] 7807 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [05:01:03] 7804 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [05:01:04] 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [05:01:10] !ack 7808 [05:01:10] 7808 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [05:01:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:04:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [05:04:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [05:06:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:11:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:16:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:17:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:19:00] !incidents [05:19:01] 7808 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [05:19:01] 7807 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [05:19:01] 7804 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [05:19:01] 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [05:21:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:22:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:24:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [05:24:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [05:28:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:33:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:36:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:39:16] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11784985 (10VRiley-WMF) So, it seems that Dell has changed their mind on me when it comes to sending out the new drive even after I've requested it. They have requested the fo... [05:40:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784986 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Understood. I will close this ticket for now. [05:46:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:48:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:53:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0600) [06:42:06] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11785033 (10MoritzMuehlenhoff) Sure, no problem. Just post the new key here and we'll get it updated. [06:42:07] (03CR) 10Ayounsi: "LGTM! It's good to deploy as it, but I left some comments for a larger cleanup. In this CR or another one." [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) (owner: 10Cathal Mooney) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0700) [07:38:02] (03CR) 10Elukey: opensearch-semantic-search-test: Add to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [07:51:13] (03PS1) 10Elukey: Add istio 1.24 config for k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267568 (https://phabricator.wikimedia.org/T414486) [08:03:26] (03PS1) 10Elukey: role::kafka::test::broker: move to Trixie and jdk 21 [puppet] - 10https://gerrit.wikimedia.org/r/1267578 [08:04:55] (03CR) 10Elukey: [C:03+2] role::kafka::test::broker: move to Trixie and jdk 21 [puppet] - 10https://gerrit.wikimedia.org/r/1267578 (owner: 10Elukey) [08:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-test1007.eqiad.wmnet with OS trixie [08:18:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:22:43] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage [08:29:48] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1007.eqiad.wmnet with reason: host reimage [08:43:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:48:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-test1007.eqiad.wmnet with OS trixie [08:50:11] (03PS1) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [08:51:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-test1008.eqiad.wmnet with OS trixie [08:53:57] (03CR) 10Elukey: "https://integration.wikimedia.org/ci/job/tox/9415/console : SUCCESS in 3m 04s" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:42] (03CR) 10Elukey: "Overall:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:04:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:07:41] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage [09:14:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1008.eqiad.wmnet with reason: host reimage [09:29:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:33:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-test1008.eqiad.wmnet with OS trixie [09:36:11] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-test1009.eqiad.wmnet with OS trixie [09:48:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:48:49] (03PS1) 10Brouberol: move opensearch-semantic-search.discovery.wmnet to dse-k8s-eqiad only [dns] - 10https://gerrit.wikimedia.org/r/1267729 [09:50:35] (03CR) 10DCausse: [C:03+1] move opensearch-semantic-search.discovery.wmnet to dse-k8s-eqiad only [dns] - 10https://gerrit.wikimedia.org/r/1267729 (owner: 10Brouberol) [09:52:16] (03CR) 10Brouberol: [C:03+2] move opensearch-semantic-search.discovery.wmnet to dse-k8s-eqiad only [dns] - 10https://gerrit.wikimedia.org/r/1267729 (owner: 10Brouberol) [09:52:44] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage [09:52:45] !log brouberol@dns1004 START - running authdns-update [09:54:34] !log brouberol@dns1004 END - running authdns-update [09:59:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1009.eqiad.wmnet with reason: host reimage [10:13:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:17:51] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-test1009.eqiad.wmnet with OS trixie [10:19:16] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-test1010.eqiad.wmnet with OS trixie [10:29:31] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:30:17] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:31:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:32:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage [10:39:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1010.eqiad.wmnet with reason: host reimage [10:51:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:55:22] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11785400 (10Gehel) No objection from me. There is a good chance that Andrea will need access to all WDQS nodes a... [10:56:22] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-test1010.eqiad.wmnet with OS trixie [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T1100). [11:40:23] 06SRE, 10Wikimedia-Mailing-lists: daily-article-l broken - https://phabricator.wikimedia.org/T422144#11785430 (10Peachey88) 05Open→03Resolved a:03MZMcBride Asking Mz was a easy step, it should now be fixed. [12:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:06] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:18:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:21:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11785467 (10VRiley-WMF) [12:22:12] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot login to Wikidata generates maxlag retry error - https://phabricator.wikimedia.org/T421642#11785468 (10Xqt) [12:22:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [12:22:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [12:22:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:23:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:37:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11785476 (10Jclark-ctr) @Jgreen I’ve set this one up. Please ping me when you get a chance. This is a newer version of the BMC, and it has some interesting issues we’re still wo... [12:37:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11785477 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen [12:48:51] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11785487 (10Xqt) [12:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11785498 (10VRiley-WMF) [13:03:50] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:07] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ganeti1055] - vriley@cumin1003" [13:08:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ganeti1055] - vriley@cumin1003" [13:08:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:22] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:24:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11785520 (10Jgreen) Hey @VRiley-WMF I'm not able to log in. Can you check the user/password? [13:25:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11785521 (10VRiley-WMF) [13:26:38] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1058 [13:26:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1058 [13:27:49] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:34:33] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:38:24] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:38:53] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:39:53] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:43:00] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:59:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251 (10AnneT) 03NEW [14:03:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [14:31:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11785643 (10VRiley-WMF) These servers aren't starting out all that great. 3 of them (ganeti1055,1056 and 1057) can't power on. There are lights on the power supplies... [14:33:49] Hey all - would like to get a temporary security patch deployed for T422244 going into the weekend. Let me know if I shouldn’t. [14:40:54] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafkamon2003.codfw.wmnet with OS trixie [14:41:22] (03PS1) 10Brouberol: Add analytics-fr-tech to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1267865 (https://phabricator.wikimedia.org/T416457) [14:44:57] (03PS1) 10Andrew Bogott: magnum: bump openstack helm chart version [puppet] - 10https://gerrit.wikimedia.org/r/1267868 [14:46:53] (03CR) 10Andrew Bogott: [C:03+2] magnum: bump openstack helm chart version [puppet] - 10https://gerrit.wikimedia.org/r/1267868 (owner: 10Andrew Bogott) [14:49:14] FIRING: JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:52:33] (03CR) 10Bking: [C:03+1] Add analytics-fr-tech to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1267865 (https://phabricator.wikimedia.org/T416457) (owner: 10Brouberol) [14:52:54] !log Deployed security mitigation for T422244 [14:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] (03CR) 10Brouberol: [C:03+2] Add analytics-fr-tech to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1267865 (https://phabricator.wikimedia.org/T416457) (owner: 10Brouberol) [14:58:12] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage [15:04:18] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafkamon2003.codfw.wmnet with reason: host reimage [15:16:43] (03CR) 10SBassett: [C:03+1] Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [15:23:14] (03PS1) 10Hashar: profile::ci::package_builder: force link of pbuilder cache [puppet] - 10https://gerrit.wikimedia.org/r/1267883 (https://phabricator.wikimedia.org/T421114) [15:25:41] (03CR) 10Hashar: "I have cherry picked it on the CI Puppet server (integration-puppetserver-01.integration.eqiad1.wikimedia.cloud) and that fixed the run:" [puppet] - 10https://gerrit.wikimedia.org/r/1267883 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [15:27:30] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:31:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2008 to codfw - jhancock@cumin2002" [15:31:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2008 to codfw - jhancock@cumin2002" [15:31:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:31:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host conf2007 [15:31:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host conf2007 [15:32:00] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host conf2008 [15:32:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host conf2008 [15:32:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host conf2009 [15:32:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host conf2009 [15:32:52] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11785798 (10AWesterinen-WMF) Here is the new key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICpR6EHfAQ8h1++hJVZD2aHQsiU8EhKS/oZWr... [15:37:55] (03PS1) 10Hashar: profile::ci::package_builder: create aptcache dir [puppet] - 10https://gerrit.wikimedia.org/r/1267887 (https://phabricator.wikimedia.org/T421114) [15:40:05] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-c7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422058#11785806 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned ports on new server. cleared. [15:40:13] (03CR) 10Hashar: "I have cherry picked it on the CI Puppet server (integration-puppetserver-01.integration.eqiad1.wikimedia.cloud) and that fixed the run:" [puppet] - 10https://gerrit.wikimedia.org/r/1267887 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [15:44:49] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-b4-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422061#11785829 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm had the wrong cable connected for a server that's still in provisioning st... [15:56:20] (03PS7) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [15:59:00] (03CR) 10Jasmine: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [16:02:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11785862 (10VRiley-WMF) @Jhancock.wm recommended to swap the PSU's with the known working one (1058) with one of the ones that isn't working (I chose 1055) after th... [16:05:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [16:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:42] (03PS2) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [16:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:26] (03PS6) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [16:09:26] (03PS3) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [16:11:16] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11785892 (10BCornwall) No problem. I just re-ran it and can confirm that the issues are still present. [16:12:33] (03CR) 10CI reject: [V:04-1] tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [16:18:30] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [16:24:12] (03PS1) 10Bking: dse-k8s: reduce readahead script timer frequency [puppet] - 10https://gerrit.wikimedia.org/r/1267898 (https://phabricator.wikimedia.org/T422262) [16:25:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1267898 (https://phabricator.wikimedia.org/T422262) (owner: 10Bking) [16:28:25] (03CR) 10Bking: [C:03+2] "self-merging as most people are out today and this is a pretty harmless change." [puppet] - 10https://gerrit.wikimedia.org/r/1267898 (https://phabricator.wikimedia.org/T422262) (owner: 10Bking) [16:33:12] (03CR) 10Dzahn: [C:03+2] Revert^4 "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1267301 (owner: 10Dzahn) [16:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:29] (03PS2) 10Dzahn: jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) [16:55:48] (03CR) 10CI reject: [V:04-1] jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [16:56:15] (03PS3) 10Dzahn: jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) [16:56:41] (03CR) 10Dzahn: [V:03+1 C:03+2] "we can now do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267173 as the next step" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [16:58:08] (03CR) 10Dzahn: [C:03+2] profile::ci::package_builder: create aptcache dir [puppet] - 10https://gerrit.wikimedia.org/r/1267887 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [16:58:26] (03CR) 10Dzahn: [C:03+2] profile::ci::package_builder: force link of pbuilder cache [puppet] - 10https://gerrit.wikimedia.org/r/1267883 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [17:03:57] (03CR) 10Dzahn: "Thanks for this. It's like this because I started with an expectation that zookeeper and certs are only needed on main hosts and we have s" [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:07:00] (03CR) 10Dzahn: [C:04-1] "this would fail on executor nodes because then "did not find a value for the name 'profile::zuul::base::zookeeper_tls_fullchain'". will tr" [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:11:08] (03PS3) 10Dzahn: zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:18:19] (03CR) 10Dzahn: [V:03+1 C:03+1] "this PS should be noop now https://puppet-compiler.wmflabs.org/output/1267177/8374/" [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:18:22] (03CR) 10Dzahn: [C:03+2] zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:20:59] (03PS3) 10Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) [17:26:22] (03CR) 10Dzahn: "Alright, this is like what I did for zuul-web / executor before but not yet for nodepool. It will mean another pair of certs but makes sen" [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:26:36] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1267178/8375/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:26:51] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:29:15] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11786141 (10Xqt) [17:34:33] (03CR) 10Dzahn: [V:03+1 C:03+2] "systemctl status nodepool looks promising :))" [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:34:50] (03CR) 10Dzahn: [V:03+1 C:03+2] "zuul-nodepool" [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [17:39:32] (03CR) 10Dzahn: [C:04-1] "did not find a value for the name 'profile::ci::docker::settings'" [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [17:44:38] (03PS4) 10Dzahn: jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) [17:58:51] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1267173/8377/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:06:19] (03CR) 10Dzahn: [C:03+2] jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:08:51] (03PS1) 10RLazarus: mw-wikifunctions: Set $MCROUTER_SERVER in values-${ENV}.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267915 (https://phabricator.wikimedia.org/T411807) [18:09:14] RESOLVED: JobUnavailable: Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:15] (03PS2) 10RLazarus: mw-wikifunctions: Set $MCROUTER_SERVER in values-${ENV}.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267915 (https://phabricator.wikimedia.org/T411807) [18:13:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:16:34] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafkamon2003.codfw.wmnet with OS trixie [18:19:51] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafkamon1003.eqiad.wmnet with OS trixie [18:21:00] (03CR) 10Herron: [C:03+1] thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [18:22:58] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267915 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [18:24:35] (03CR) 10Dzahn: [C:03+2] "[contint2003:~] $ dpkg -l | grep docker" [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:29:14] FIRING: JobUnavailable: Reduced availability for job burrow in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:32:36] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafkamon1003.eqiad.wmnet with reason: host reimage [18:33:11] (03PS1) 10Ottomata: mw-page-html-content-change-enrich - temporarily always consume from latest offsets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267917 (https://phabricator.wikimedia.org/T421216) [18:35:25] (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - temporarily always consume from latest offsets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267917 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [18:35:33] (03CR) 10Ottomata: [V:03+2 C:03+2] mw-page-html-content-change-enrich - temporarily always consume from latest offsets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267917 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [18:38:44] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:38:49] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:39:35] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafkamon1003.eqiad.wmnet with reason: host reimage [18:49:14] RESOLVED: JobUnavailable: Reduced availability for job burrow in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:52:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:25] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafkamon1003.eqiad.wmnet with OS trixie [18:56:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:57:05] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:57:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:58:05] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:02:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:00:40] (03PS1) 10Aude: Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) [20:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:14] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:13:19] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:39:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:50:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11786569 (10BCornwall) We're going to be discussing whether we want to pursue this still, sorry for the premature bug report. We'll probably discuss next tuesday in our sync-up. [20:55:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:55:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:59:05] ^^ checking [21:02:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:02:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:05:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:06:04] (03CR) 10Scott French: mw-wikifunctions: Set $MCROUTER_SERVER in values-${ENV}.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267915 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [21:06:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:06:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:09:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:11:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:12:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:27:45] I am hitting the api-limit when testing for the monuments_db, we suspect it is because of no user_agent, but we are not completely sure, if I give the information I got from the request, can somebody check? [21:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:37:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:25:53] 06SRE, 06Data-Platform-SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11786776 (10Ahoelzl) [22:29:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:38:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:41:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:42:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:51:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:52:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:55:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:55:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:01:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:02:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:10:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:11:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:12:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:15:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:15:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:16:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:17:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:20:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:21:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:23:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:30:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:31:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:34:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:34:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:36:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:36:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:39:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:39:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267994 [23:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267994 (owner: 10TrainBranchBot) [23:42:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:43:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:45:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:46:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:46:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:48:50] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: T421398 [23:48:53] T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398 [23:49:04] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul2002.codfw.wmnet with reason: T421398 [23:49:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:50:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:54:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267994 (owner: 10TrainBranchBot) [23:56:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:58:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:59:07] (03PS5) 10Krinkle: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) [23:59:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal