[00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193272 [00:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193272 (owner: 10TrainBranchBot) [00:10:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:28:54] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193272 (owner: 10TrainBranchBot) [00:32:05] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1012.eqiad.wmnet with OS bookworm [00:38:54] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [00:43:53] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [00:44:14] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [00:49:13] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [00:49:32] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [00:55:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:56:42] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [00:57:02] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:02:58] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [01:03:17] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [01:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:15:10] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 12s) [01:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:23:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:26:19] (03PS1) 10Tim Starling: Ensure linkUpdateComplete handler is only run for entities [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193281 [01:26:48] (03PS2) 10Tim Starling: Ensure linkUpdateComplete handler is only run for entities [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193281 (https://phabricator.wikimedia.org/T406192) [01:30:50] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [01:30:56] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [01:31:42] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Docker [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:08] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 859293392 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:52:49] (03PS1) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [01:55:08] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:56:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:59:05] (03PS2) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [02:13:11] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:54] (03PS1) 10Krinkle: [WIP] varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [02:14:15] (03PS2) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [02:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:24:32] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [02:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:36:06] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1235 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [02:36:07] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1235 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T406293 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [02:36:16] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293 (10ops-monitoring-bot) 03NEW [02:42:41] (03PS1) 10Tim Starling: Fallback to first result row if none in baselang is found [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193291 (https://phabricator.wikimedia.org/T406196) [03:15:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193291 (https://phabricator.wikimedia.org/T406196) (owner: 10Tim Starling) [03:15:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193281 (https://phabricator.wikimedia.org/T406192) (owner: 10Tim Starling) [03:16:45] (03Merged) 10jenkins-bot: Fallback to first result row if none in baselang is found [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193291 (https://phabricator.wikimedia.org/T406196) (owner: 10Tim Starling) [03:24:15] fceratto@cumin1002 clone_es (PID 248426) is awaiting input [03:24:15] (03Merged) 10jenkins-bot: Ensure linkUpdateComplete handler is only run for entities [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193281 (https://phabricator.wikimedia.org/T406192) (owner: 10Tim Starling) [03:24:48] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1193291|Fallback to first result row if none in baselang is found (T406196)]], [[gerrit:1193281|Ensure linkUpdateComplete handler is only run for entities (T406192)]] [03:24:53] T406196: PHP Warning: Undefined array key "en" - https://phabricator.wikimedia.org/T406196 [03:24:54] T406192: UnhandledMatchError: Unhandled match case '' - https://phabricator.wikimedia.org/T406192 [03:30:46] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1193291|Fallback to first result row if none in baselang is found (T406196)]], [[gerrit:1193281|Ensure linkUpdateComplete handler is only run for entities (T406192)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:30:51] T406196: PHP Warning: Undefined array key "en" - https://phabricator.wikimedia.org/T406196 [03:30:51] T406192: UnhandledMatchError: Unhandled match case '' - https://phabricator.wikimedia.org/T406192 [03:31:08] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 450107704 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:31:39] !log tstarling@deploy2002 tstarling: Continuing with sync [03:33:08] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 72 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:36:03] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193291|Fallback to first result row if none in baselang is found (T406196)]], [[gerrit:1193281|Ensure linkUpdateComplete handler is only run for entities (T406192)]] (duration: 11m 15s) [03:36:10] T406196: PHP Warning: Undefined array key "en" - https://phabricator.wikimedia.org/T406196 [03:36:10] T406192: UnhandledMatchError: Unhandled match case '' - https://phabricator.wikimedia.org/T406192 [03:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:32:59] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [04:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:38:14] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:40:55] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [04:41:31] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [04:43:14] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:47:37] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [04:47:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11239402 (10jhathaway) @jcrespo it took me a bit of time to coerce the box back into bios mode. I then tried reimaging with bookworm, but the raid... [04:48:14] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:53:14] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:48] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:14] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:34:52] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:55] (03PS3) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [05:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:39] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1193089 (owner: 10L10n-bot) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251003T0600) [06:00:45] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring: replace Gunicorn with uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/1193109 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [06:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 21.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:06:51] (03CR) 10Slyngshede: [C:03+1] Add Gabriele Modena (gmodena) to wdqs-roots, wdqs-admins groups [puppet] - 10https://gerrit.wikimedia.org/r/1193211 (https://phabricator.wikimedia.org/T404161) (owner: 10Bking) [06:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 21.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:13:12] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:20:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:25:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:31:35] (03PS9) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [06:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 7.646% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 7.646% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:40:09] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#11239477 (10hashar) As part of this task I found out that the logic to ban an account in Gerrit was not migrated from the wikitech-l MediaWiki hoo... [06:40:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:43:49] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#11239481 (10hashar) I can confirm the {nav BarryTheBrowserTestBot} account is now inactive in Gerrit. From the {nav All-Users.git} database: ` $ g... [06:44:35] (03CR) 10Jelto: [C:03+1] "lgtm, one nit in-line!" [dns] - 10https://gerrit.wikimedia.org/r/1193082 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:45:27] (03CR) 10Slyngshede: P:cache::haproxy copy private repo data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [06:51:54] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Prefer IPv4 for backend nodes [puppet] - 10https://gerrit.wikimedia.org/r/1193164 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [06:52:34] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Prefer IPv4 for backend nodes [puppet] - 10https://gerrit.wikimedia.org/r/1193164 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [06:58:18] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Add resolver config to api-gateway-tcp [puppet] - 10https://gerrit.wikimedia.org/r/1193310 (https://phabricator.wikimedia.org/T405078) [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251003T0700) [07:01:26] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add resolver config to api-gateway-tcp [puppet] - 10https://gerrit.wikimedia.org/r/1193310 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [07:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 0.7282% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:09:05] (03CR) 10Brouberol: [C:03+1] Add Gabriele Modena (gmodena) to wdqs-roots, wdqs-admins groups [puppet] - 10https://gerrit.wikimedia.org/r/1193211 (https://phabricator.wikimedia.org/T404161) (owner: 10Bking) [07:09:11] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:39] (03CR) 10Brouberol: [C:03+2] Add Gabriele Modena (gmodena) to wdqs-roots, wdqs-admins groups [puppet] - 10https://gerrit.wikimedia.org/r/1193211 (https://phabricator.wikimedia.org/T404161) (owner: 10Bking) [07:12:12] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:12:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:12:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:14:11] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:14:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:15:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:16:07] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:16:07] !incidents [07:16:08] 6826 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:16:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:16:16] !ack 6826 [07:16:17] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:16:31] let me check whats going on there [07:16:36] * Emperor here (if only just awake) [07:16:53] looks like esams [07:16:57] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:09] !incidents [07:17:10] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:17:10] 6827 (UNACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:17:14] !ack 6827 [07:17:15] 6827 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:17:16] !ack 6827 [07:17:17] 6827 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:17:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:17:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:17:57] !incidents [07:17:57] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:17:57] 6827 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:17:58] 6828 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:18:02] !ack 6828 [07:18:03] 6828 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:19:11] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:44] hmm mw-api-int is in codfw, so not the "new" wikikube eqiad cluster [07:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:21:57] RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:22:36] FIRING: RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mw-api-int_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [07:22:46] !incidents [07:22:47] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:22:47] 6828 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:22:47] 6829 (UNACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:22:47] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:22:52] !ack 6829 [07:22:52] 6829 (ACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:22:57] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:09] !incidents [07:23:10] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:23:10] 6828 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:23:10] 6829 (ACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:23:10] 6830 (UNACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:23:11] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:23:16] !ack 6830 [07:23:17] 6830 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:23:34] (03CR) 10Brouberol: [C:03+1] Add test namespace to ceph tenantNamepsaces dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193190 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [07:24:44] (03PS1) 10Stevemunene: Add dummy keytabs for analytics-research on stat servers [labs/private] - 10https://gerrit.wikimedia.org/r/1193314 (https://phabricator.wikimedia.org/T403207) [07:24:51] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:25:12] (03CR) 10Brouberol: [C:03+1] EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192950 (https://phabricator.wikimedia.org/T304373) (owner: 10Ottomata) [07:27:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:48] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:53] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:29:11] FIRING: [25x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:51] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:32:36] FIRING: [2x] RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [07:32:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:33:59] !incidents [07:34:00] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:34:00] 6829 (ACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:34:00] 6830 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:34:00] 6831 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:34:00] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:34:01] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:34:07] !ack 6831 [07:34:07] 6831 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:34:11] FIRING: [25x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:49] (03CR) 10Stevemunene: [C:03+2] Add test namespace to ceph tenantNamepsaces dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193190 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [07:34:51] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:37:43] FIRING: [20x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:48] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:57] RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:38:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:38:53] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:38:55] !incidents [07:38:55] 6826 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:38:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:38:55] 6829 (ACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:38:55] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:38:56] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:38:56] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:38:56] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:39:11] RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:39:11] FIRING: [21x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:46] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:39:51] RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:40:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:41:07] !incidents [07:41:07] 6829 (ACKED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:41:08] 6826 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:41:08] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:41:08] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:41:08] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:41:08] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:42:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:42:36] RESOLVED: [2x] RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [07:42:43] FIRING: [20x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:58] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:43:12] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:43:24] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet'] [07:43:39] !incidents [07:43:39] 6829 (RESOLVED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [07:43:40] 6826 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:43:40] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:43:40] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:43:40] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [07:43:40] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [07:44:11] FIRING: [19x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:15] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2050.codfw.wmnet'] [07:44:19] (03CR) 10Cappybaraa: "Everything is done, please review the patch if it's ready for merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [07:44:40] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2051.codfw.wmnet'] [07:45:02] (03Merged) 10jenkins-bot: Add test namespace to ceph tenantNamepsaces dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193190 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [07:47:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:49:11] FIRING: [13x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:05] (03PS1) 10Majavah: P:toolforge::proxy: Disable connection failure tracking [puppet] - 10https://gerrit.wikimedia.org/r/1193317 (https://phabricator.wikimedia.org/T405078) [07:51:21] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2051.codfw.wmnet'] [07:51:31] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::proxy: Disable connection failure tracking [puppet] - 10https://gerrit.wikimedia.org/r/1193317 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [07:51:34] (03PS1) 10Elukey: cpufrequtils: use restart for cpupower [puppet] - 10https://gerrit.wikimedia.org/r/1193318 (https://phabricator.wikimedia.org/T405891) [07:52:43] FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:46] (03CR) 10Klausman: "Two small things, other wise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:52:48] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:54:11] RESOLVED: [12x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:34] (03CR) 10Elukey: prometheus: update the amd-rocm exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:57:43] FIRING: [14x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:39] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2028 gradually with 4 steps - Pool es2028.codfw.wmnet in after cloning [08:01:40] (03CR) 10Jon Harald Søby: "Please change the commit title like I suggested, and then I think it's ready to go. Good job! 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:01:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:02:02] !incidents [08:02:02] 6832 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [08:02:03] 6829 (RESOLVED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [08:02:03] 6826 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [08:02:03] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [08:02:03] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:02:04] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [08:02:04] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:02:11] !ack 6832 [08:02:11] 6832 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [08:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:02:43] RESOLVED: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:02:47] (03CR) 10Elukey: [C:03+2] cpufrequtils: use restart for cpupower [puppet] - 10https://gerrit.wikimedia.org/r/1193318 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:02:57] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:13] !incidents [08:03:13] 6832 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [08:03:13] 6833 (UNACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:03:14] 6829 (RESOLVED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [08:03:14] 6826 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [08:03:14] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [08:03:14] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:03:14] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [08:03:15] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:03:23] !ack 6833 [08:03:24] 6833 (ACKED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [08:04:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:04:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:05:22] (03PS2) 10Majavah: P:toolforge::proxy: Disable connection failure tracking [puppet] - 10https://gerrit.wikimedia.org/r/1193317 (https://phabricator.wikimedia.org/T405078) [08:05:56] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:06:01] (03PS1) 10Brouberol: wdqs: add Artsdata to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193357 (https://phabricator.wikimedia.org/T402905) [08:06:03] (03PS1) 10Brouberol: wdqs: add DBPedia to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193358 (https://phabricator.wikimedia.org/T402898) [08:06:05] (03PS1) 10Brouberol: wdqs: add SNARC to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193359 (https://phabricator.wikimedia.org/T403018) [08:06:08] (03PS1) 10Brouberol: wdqs: add YAGO to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193360 (https://phabricator.wikimedia.org/T402907) [08:06:10] (03PS1) 10Brouberol: wdqs: add DDB-KB to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193361 (https://phabricator.wikimedia.org/T402909) [08:06:12] (03PS1) 10Brouberol: wdqs: add RKD schema.org Knowledge Graph to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193362 (https://phabricator.wikimedia.org/T401919) [08:06:14] (03PS1) 10Brouberol: wdqs: add beta.sparql.swisslipids.org to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193363 (https://phabricator.wikimedia.org/T403384) [08:06:16] (03PS1) 10Brouberol: wdqs: add the HTTPS endpoint of MeSH to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193364 (https://phabricator.wikimedia.org/T402899) [08:06:18] (03PS1) 10Brouberol: wdqs: add NFDI Open Math Research Data to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193365 (https://phabricator.wikimedia.org/T403036) [08:06:20] (03PS1) 10Brouberol: wdqs: add Food Standards Agency Codes to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193366 (https://phabricator.wikimedia.org/T402908) [08:06:24] (03PS1) 10Brouberol: wdqs: add Rhea to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193367 (https://phabricator.wikimedia.org/T402901) [08:06:28] (03PS1) 10Brouberol: wdqs: add HAL to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193368 (https://phabricator.wikimedia.org/T339832) [08:06:34] (03CR) 10Brouberol: [C:03+1] Add dummy keytabs for analytics-research on stat servers [labs/private] - 10https://gerrit.wikimedia.org/r/1193314 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:06:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at codfw: 7.362% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:07:15] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add dummy keytabs for analytics-research on stat servers [labs/private] - 10https://gerrit.wikimedia.org/r/1193314 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:07:57] RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:02] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:09:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:09:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:16:53] (03PS1) 10Stevemunene: Fix typo in cephfs values namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193369 (https://phabricator.wikimedia.org/T396478) [08:18:50] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::proxy: Disable connection failure tracking [puppet] - 10https://gerrit.wikimedia.org/r/1193317 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [08:19:09] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Disable connection failure tracking [puppet] - 10https://gerrit.wikimedia.org/r/1193317 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [08:21:17] (03CR) 10Brouberol: [C:03+2] wdqs: add Artsdata to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193357 (https://phabricator.wikimedia.org/T402905) (owner: 10Brouberol) [08:21:19] (03CR) 10Brouberol: [C:03+2] wdqs: add DBPedia to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193358 (https://phabricator.wikimedia.org/T402898) (owner: 10Brouberol) [08:21:22] (03CR) 10Brouberol: [C:03+2] wdqs: add SNARC to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193359 (https://phabricator.wikimedia.org/T403018) (owner: 10Brouberol) [08:21:25] (03CR) 10Brouberol: [C:03+2] wdqs: add YAGO to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193360 (https://phabricator.wikimedia.org/T402907) (owner: 10Brouberol) [08:21:27] (03CR) 10Brouberol: [C:03+2] wdqs: add DDB-KB to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193361 (https://phabricator.wikimedia.org/T402909) (owner: 10Brouberol) [08:21:30] (03CR) 10Brouberol: [C:03+2] wdqs: add RKD schema.org Knowledge Graph to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193362 (https://phabricator.wikimedia.org/T401919) (owner: 10Brouberol) [08:21:32] (03CR) 10Brouberol: [C:03+2] wdqs: add beta.sparql.swisslipids.org to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193363 (https://phabricator.wikimedia.org/T403384) (owner: 10Brouberol) [08:21:35] (03CR) 10Brouberol: [C:03+2] wdqs: add the HTTPS endpoint of MeSH to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193364 (https://phabricator.wikimedia.org/T402899) (owner: 10Brouberol) [08:21:37] (03CR) 10Brouberol: [C:03+2] wdqs: add NFDI Open Math Research Data to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193365 (https://phabricator.wikimedia.org/T403036) (owner: 10Brouberol) [08:21:40] (03CR) 10Brouberol: [C:03+2] wdqs: add Food Standards Agency Codes to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193366 (https://phabricator.wikimedia.org/T402908) (owner: 10Brouberol) [08:21:44] (03CR) 10Brouberol: [C:03+2] wdqs: add Rhea to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193367 (https://phabricator.wikimedia.org/T402901) (owner: 10Brouberol) [08:21:46] (03CR) 10Brouberol: [C:03+2] wdqs: add HAL to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193368 (https://phabricator.wikimedia.org/T339832) (owner: 10Brouberol) [08:21:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:23:03] (03CR) 10Brouberol: [C:03+1] Fix typo in cephfs values namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193369 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [08:24:29] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:25:00] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:25:14] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2053.codfw.wmnet'] [08:26:15] (03CR) 10Stevemunene: [C:03+2] Fix typo in cephfs values namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193369 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [08:27:11] (03PS8) 10Cappybaraa: Change Portal talk namespace name for diqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [08:27:14] (03PS2) 10Stevemunene: Add the analytics-research keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) [08:28:04] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:28:19] (03CR) 10Klausman: prometheus: update the amd-rocm exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:29:25] !log brouberol@cumin1003 START - Cookbook sre.wdqs.restart [08:30:01] (03PS3) 10Elukey: prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) [08:30:16] (03CR) 10Stevemunene: [C:03+1] Bump the image for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193143 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:30:59] (03PS1) 10Jcrespo: installserver: Revert backup1012 to manual setup [puppet] - 10https://gerrit.wikimedia.org/r/1193370 (https://phabricator.wikimedia.org/T371416) [08:31:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2053.codfw.wmnet'] [08:33:18] (03CR) 10Btullis: [C:03+2] Bump the image for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193143 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:33:28] (03Merged) 10jenkins-bot: Fix typo in cephfs values namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193369 (https://phabricator.wikimedia.org/T396478) (owner: 10Stevemunene) [08:33:42] !log brouberol@cumin1003 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [08:33:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:33:48] (03CR) 10Jon Harald Søby: [C:03+1] "Excellent, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:35:34] (03PS4) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [08:35:34] (03PS1) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193371 [08:36:16] (03CR) 10Jcrespo: [C:03+2] installserver: Revert backup1012 to manual setup [puppet] - 10https://gerrit.wikimedia.org/r/1193370 (https://phabricator.wikimedia.org/T371416) (owner: 10Jcrespo) [08:36:26] (03CR) 10CI reject: [V:04-1] varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193371 (owner: 10Krinkle) [08:38:40] (03PS3) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [08:38:41] (03PS5) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [08:38:56] (03Abandoned) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193371 (owner: 10Krinkle) [08:40:35] (03Merged) 10jenkins-bot: Bump the image for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193143 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:43:04] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2053.codfw.wmnet'] [08:43:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:44:47] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:44:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:46:05] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2028 gradually with 4 steps - Pool es2028.codfw.wmnet in after cloning [08:46:06] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2028.codfw.wmnet onto es2051.codfw.wmnet [08:46:17] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:51:45] (03PS2) 10Federico Ceratto: es2051.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1193058 [08:51:50] (03PS2) 10Federico Ceratto: instances.yaml: add es2051 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1193059 (https://phabricator.wikimedia.org/T402859) [08:51:59] (03CR) 10Cathal Mooney: [C:03+2] cr1-eqiad: add BGP to ssw1-d1-eqiad spine [homer/public] - 10https://gerrit.wikimedia.org/r/1193146 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [08:53:20] (03PS6) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [08:53:25] (03PS9) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [08:53:40] (03Merged) 10jenkins-bot: cr1-eqiad: add BGP to ssw1-d1-eqiad spine [homer/public] - 10https://gerrit.wikimedia.org/r/1193146 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [08:57:26] (03PS4) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [08:57:27] (03PS7) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [08:57:27] (03PS10) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [08:57:50] (03CR) 10Btullis: [C:03+1] Add the analytics-research keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:59:00] (03CR) 10Stevemunene: [C:03+2] Add the analytics-research keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [08:59:51] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2053.codfw.wmnet'] [09:04:44] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2054.codfw.wmnet'] [09:06:22] (03CR) 10Klausman: [C:03+1] prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [09:07:01] !log jynus@cumin1003 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [09:07:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11239841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1003 for host backup1012.eqiad.wmnet with OS bookworm [09:11:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2054.codfw.wmnet'] [09:18:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:21:03] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2054.codfw.wmnet'] [09:22:46] (03PS2) 10Elukey: redfish: allow HTTP 204 responses in poll_task [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) [09:24:16] (03CR) 10Elukey: redfish: allow HTTP 204 responses in poll_task (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [09:27:29] (03CR) 10Vgutierrez: [C:03+1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:27:51] !log jynus@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [09:31:53] (03CR) 10Elukey: [C:03+2] prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [09:33:33] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [09:38:36] (03PS1) 10Brouberol: wdqs: fix federation URLs and add sparql.swisslipids.org to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193379 (https://phabricator.wikimedia.org/T403384) [09:39:48] (03PS2) 10Brouberol: wdqs: fix federation URLs and add sparql.swisslipids.org to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193379 (https://phabricator.wikimedia.org/T403384) [09:40:07] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2054.codfw.wmnet'] [09:40:40] (03CR) 10Brouberol: [C:03+2] wdqs: fix federation URLs and add sparql.swisslipids.org to allow-list [puppet] - 10https://gerrit.wikimedia.org/r/1193379 (https://phabricator.wikimedia.org/T403384) (owner: 10Brouberol) [09:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:56] (03PS1) 10Elukey: profile::amd_gpu: use a more generic path for amd-smi [puppet] - 10https://gerrit.wikimedia.org/r/1193380 [09:44:00] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [09:44:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [09:44:43] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2057.codfw.wmnet'] [09:48:49] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts druid1007.eqiad.wmnet [09:52:01] (03PS2) 10Tiziano Fogli: metamonitoring: cleanup gunicorn related resources [puppet] - 10https://gerrit.wikimedia.org/r/1193373 (https://phabricator.wikimedia.org/T397003) [09:52:02] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193373 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:52:28] (03PS7) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [09:53:15] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: use a more generic path for amd-smi [puppet] - 10https://gerrit.wikimedia.org/r/1193380 (owner: 10Elukey) [09:55:04] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1012.eqiad.wmnet with OS bookworm [09:55:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11240370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1003 for host backup1012.eqiad.wmnet with OS bookworm co... [09:56:06] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:57:59] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2057.codfw.wmnet'] [09:58:12] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:42] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [09:59:50] (03PS3) 10Tiziano Fogli: metamonitoring: rename uwsgi resource [puppet] - 10https://gerrit.wikimedia.org/r/1193374 (https://phabricator.wikimedia.org/T397003) [09:59:50] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193374 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:00:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [10:00:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:00:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1007.eqiad.wmnet [10:01:15] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts druid1008.eqiad.wmnet [10:02:52] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:03:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11240391 (10jcrespo) It is normal that the recipe was not working, not only the logical configuration was destroyed, the RAID was not setup, too, s... [10:05:41] (03PS3) 10Tiziano Fogli: metamonitoring: prepare deadmanswitchamhook's Gunicorn replacement with uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/1193375 (https://phabricator.wikimedia.org/T397003) [10:05:41] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193375 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:08:12] RESOLVED: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:22] cmooney@cumin1003 netbox (PID 1080258) is awaiting input [10:09:26] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:12:14] (03PS5) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) [10:12:21] (03CR) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [10:12:23] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [10:12:59] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names for cr2-eqiad et-1/0/5.100 interface IPs - cmooney@cumin1003" [10:13:16] (03PS3) 10Tiziano Fogli: metamonitoring: replace deadmanswitchamhook's Gunicorn with uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/1193376 (https://phabricator.wikimedia.org/T397003) [10:13:16] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193376 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:14:27] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:14:28] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts druid1008.eqiad.wmnet [10:14:41] !log drain transport circuits on PIC 1/0 of cr2-eqiad to allow for card reboot T402588 [10:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:44] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [10:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:14:56] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:15:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names for cr2-eqiad et-1/0/5.100 interface IPs - cmooney@cumin1003" [10:15:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:18:31] (03PS2) 10Tiziano Fogli: metamonitoring: add env vars to uwsgi process [puppet] - 10https://gerrit.wikimedia.org/r/1193381 (https://phabricator.wikimedia.org/T397003) [10:18:31] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193381 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:19:11] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [10:21:06] !log drain traffic from cr2-codfw <-> ssw1-f1-codfw link to allow for cr2-codfw card reset T402588 [10:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:09] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [10:22:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:24:01] (03PS2) 10Tiziano Fogli: metamonitoring: cleanup unneeded env files [puppet] - 10https://gerrit.wikimedia.org/r/1193382 (https://phabricator.wikimedia.org/T397003) [10:24:01] (03CR) 10Tiziano Fogli: [C:03+2] "Tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1193382 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:24:53] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:27:00] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cr[1-2]-eqiad,cr2-eqord,cr1-magru,ssw1-f1-eqiad with reason: reset PIC 0/1 in cr2 to set port 5 speed [10:27:10] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588#11240440 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da4ec6fc-e51f-4967-b0f1-8ef51813239b) set by cmooney@cumin1003 for 0:10:00 on 5 host(s) and... [10:27:42] !log reset PIC 1/0 on cr2-eqiad to configure port 5 speed T402588 [10:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:45] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [10:28:35] (03PS1) 10Hnowlan: rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) [10:29:37] (03PS2) 10Hnowlan: rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) [10:32:31] (03PS1) 10Btullis: spark-operator: Use a self-signed certificate for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193390 (https://phabricator.wikimedia.org/T405490) [10:35:28] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11240462 (10cmooney) >>! In T396065#11238854, @VRiley-WMF wrote: > Finished up ssw1-d8-eqiad and has been connected aside from ssw1-e1-eqiad ,ssw1-f1-eqiad Thanks @VRiley-WMF, I can confirm a... [10:35:37] (03PS1) 10Gkyziridis: ml-services: Deploy enwiki-goodfaith on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193391 (https://phabricator.wikimedia.org/T403236) [10:37:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:38:39] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11240477 (10FCeratto-WMF) @MKopec confirmed both public SSH key and username over Slack. [10:39:01] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11240478 (10FCeratto-WMF) [10:39:11] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [10:41:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:42:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:45:42] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for working on this, George. LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193391 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [10:46:55] (03CR) 10Btullis: [C:03+2] spark-operator: Use a self-signed certificate for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193390 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:51:31] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [10:53:56] (03Merged) 10jenkins-bot: spark-operator: Use a self-signed certificate for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193390 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251003T0700) [11:00:05] jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251003T1100). [11:02:59] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy enwiki-goodfaith on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193391 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [11:04:10] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11240548 (10cmooney) @Jclark-ctr there are also two cables mixed up on ssw1-d1-eqiad. Port 12 is connected to lsw1-d6-eqiad, and port 14 is connected to lsw1-d4-eqiad. Should... [11:04:33] (03Merged) 10jenkins-bot: ml-services: Deploy enwiki-goodfaith on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193391 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [11:05:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:06:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:15:56] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:18:09] (03CR) 10Stevemunene: [C:03+2] remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [11:29:18] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11240608 (10Jclark-ctr) Sorry about that i had it connected to port 8 >>! In T401238#11237306, @cmooney wrote: > Hey @Jclark-ctr > > Thanks for connecting the optics. I s... [11:35:16] (03CR) 10Brouberol: Define airflow-wikidata airflow instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:37:20] (03PS1) 10Gkyziridis: ml-services: Remove zhwiki revscoring-editquality-goodfaith from staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193398 (https://phabricator.wikimedia.org/T403236) [11:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:40:11] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11240667 (10Jclark-ctr) 05Open→03Resolved Resolved all other cabling issues [11:42:22] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11240672 (10cmooney) Thanks! All looks good yep. ` | ethernet-1/2 | enable | up | 100G | 100G CWDM4 MSA with FEC | Core: lsw1-c2-eqiad:ethernet-1/56... [11:45:21] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Remove zhwiki revscoring-editquality-goodfaith from staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193398 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [11:50:30] 10SRE-swift-storage, 06Commons: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw - https://phabricator.wikimedia.org/T406246#11240699 (10MatthewVernon) As expected from the report, the object is in codfw, but not eqiad: ` root@ms-fe1009:~# swift stat wikipedia-commons-... [11:52:55] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove zhwiki revscoring-editquality-goodfaith from staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193398 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [11:54:41] (03Merged) 10jenkins-bot: ml-services: Remove zhwiki revscoring-editquality-goodfaith from staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193398 (https://phabricator.wikimedia.org/T403236) (owner: 10Gkyziridis) [11:57:06] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:07:21] (03PS1) 10Btullis: Bump the version of the spark-operator image used. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193401 (https://phabricator.wikimedia.org/T405490) [12:11:07] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:11:24] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:11:37] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:11:44] (03CR) 10Brouberol: [C:03+1] Bump the version of the spark-operator image used. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193401 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:12:18] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:15:19] (03CR) 10Btullis: [C:03+2] Bump the version of the spark-operator image used. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193401 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:16:24] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:16:42] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:18:27] (03PS1) 10Brouberol: Redirect helmfile_admin_ng_pending_changes alerts for both dse-k8s-eqiad/codfw to dpe sre [alerts] - 10https://gerrit.wikimedia.org/r/1193406 [12:22:07] (03PS2) 10Brouberol: Redirect helmfile_admin_ng_pending_changes alerts for both dse-k8s-eqiad/codfw to dpe sre [alerts] - 10https://gerrit.wikimedia.org/r/1193406 (https://phabricator.wikimedia.org/T396478) [12:22:36] (03Merged) 10jenkins-bot: Bump the version of the spark-operator image used. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193401 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:23:24] (03CR) 10Btullis: [C:03+1] Redirect helmfile_admin_ng_pending_changes alerts for both dse-k8s-eqiad/codfw to dpe sre [alerts] - 10https://gerrit.wikimedia.org/r/1193406 (https://phabricator.wikimedia.org/T396478) (owner: 10Brouberol) [12:23:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:26:27] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192298 (owner: 10PipelineBot) [12:26:34] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192297 (owner: 10PipelineBot) [12:27:30] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [12:30:07] (03CR) 10Brouberol: [C:03+2] Redirect helmfile_admin_ng_pending_changes alerts for both dse-k8s-eqiad/codfw to dpe sre [alerts] - 10https://gerrit.wikimedia.org/r/1193406 (https://phabricator.wikimedia.org/T396478) (owner: 10Brouberol) [12:31:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11240827 (10Jclark-ctr) I have pulled 2 Servers from decom i have wiped all configs from them and racked and cabled sretest1005 Rack... [12:32:15] (03CR) 10Jcrespo: [C:03+2] backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [12:34:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:43:06] (03CR) 10Tiziano Fogli: "Here's the output from the Puppet run on phi-titan-01: https://pastebin.com/euC63w5L." [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [12:51:05] (03PS1) 10Cathal Mooney: site.pp: modify regex to include sretest1005 and sretest1006 [puppet] - 10https://gerrit.wikimedia.org/r/1193410 (https://phabricator.wikimedia.org/T405560) [12:58:01] (03PS1) 10Jcrespo: backup: Fix typo on new storage config on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/1193411 (https://phabricator.wikimedia.org/T403946) [13:00:46] (03CR) 10Jcrespo: [C:03+2] backup: Fix typo on new storage config on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/1193411 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [13:02:44] !log reedy Deployed security patch for T406322 [13:04:15] (03PS7) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [13:04:36] (03PS4) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [13:05:13] (03PS8) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [13:05:17] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [13:06:53] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328 (10APizzata-WMF) 03NEW [13:06:55] (03PS1) 10Stevemunene: Add the analytics-research-admin group to stat admins [puppet] - 10https://gerrit.wikimedia.org/r/1193412 (https://phabricator.wikimedia.org/T403207) [13:07:33] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:07:39] (03PS1) 10Majavah: P:toolforge::proxy: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/1193413 [13:08:17] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:08:52] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:08:59] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7188/console" [puppet] - 10https://gerrit.wikimedia.org/r/1193413 (owner: 10Majavah) [13:09:05] (03CR) 10Jcrespo: [C:03+2] "This is quite terrible, and we need to reorganize director.pp and how it handles bacula abstractions, but I won't refactor anything here, " [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [13:10:13] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193412 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [13:10:57] (03PS2) 10Cathal Mooney: site.pp: modify regex to include sretest1005 and sretest1006 [puppet] - 10https://gerrit.wikimedia.org/r/1193410 (https://phabricator.wikimedia.org/T405560) [13:10:59] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11240966 (10APizzata-WMF) [13:11:12] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:15:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11240982 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 216687198 was successfully submitted. [13:16:07] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [13:18:18] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:19:13] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 2 others: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801#11241003 (10Stevemunene) a:05Stevemunene→03None [13:19:38] (03PS1) 10Btullis: Bump the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193416 (https://phabricator.wikimedia.org/T405490) [13:20:41] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::proxy: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/1193413 (owner: 10Majavah) [13:20:52] (03CR) 10Herron: [V:03+1] "interesting thanks! yeah weird, the thanos-rule@ file resource isn't even in the catalog. I'll work on sorting that out" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [13:24:42] (03CR) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [13:25:54] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331 (10JMonton-WMF) 03NEW [13:26:40] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11241067 (10FCeratto-WMF) [13:26:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193410 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney) [13:27:09] (03CR) 10Btullis: [C:03+2] Bump the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193416 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [13:27:25] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11241079 (10FCeratto-WMF) [13:27:57] (03CR) 10Cathal Mooney: [C:03+2] site.pp: modify regex to include sretest1005 and sretest1006 [puppet] - 10https://gerrit.wikimedia.org/r/1193410 (https://phabricator.wikimedia.org/T405560) (owner: 10Cathal Mooney) [13:30:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11241087 (10cmooney) >>! In T405560#11240827, @Jclark-ctr wrote: > I have pulled 2 Servers from decom i have wiped all configs from t... [13:33:35] (03CR) 10Cappybaraa: "@ssethi@wikimedia.org Everything is done in this patch can you please give +2 ready for merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [13:34:19] (03Merged) 10jenkins-bot: Bump the spark-operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193416 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [13:36:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:37:37] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:41:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:15] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::proxy: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/1193413 (owner: 10Majavah) [13:44:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11241198 (10jhathaway) >>! In T371416#11240391, @jcrespo wrote: > It is normal that the recipe was not working, not only the logical configuration... [13:49:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:03] ^^ got paged about this one, seems ok though lots of traffic on tcp 443 to/from the box [13:50:16] (03CR) 10Brouberol: [C:03+1] Add the analytics-research-admin group to stat admins [puppet] - 10https://gerrit.wikimedia.org/r/1193412 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [13:50:34] (03CR) 10JHathaway: [C:03+1] redfish: allow HTTP 204 responses in poll_task [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:50:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11241205 (10jcrespo) > what steps did you take to re-image it I had to redo the HW RAID, which was missing from the configuration of the host thro... [13:51:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:21] (03PS1) 10Ssingh: site.pp: add hcaptcha VMs [puppet] - 10https://gerrit.wikimedia.org/r/1193423 (https://phabricator.wikimedia.org/T405631) [13:52:24] ¯\_(ツ)_/¯ [13:53:20] !incidents [13:53:20] 6834 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [13:53:21] 6833 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [13:53:21] 6832 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:53:21] 6829 (RESOLVED) RESTGatewayBackendErrorsHigh sre (mw-api-int_cluster rest-gateway codfw) [13:53:21] 6826 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:53:21] 6831 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:53:22] 6830 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [13:53:22] 6828 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:53:23] 6827 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [13:53:31] it resolved itself [13:54:28] yeah, judging from its network graphs there was a big surge in requests [13:54:39] there was related spike in CPU and load went up, but back to normal now [13:55:22] I guess if it happens again we can try to work out where the requests are from [13:57:37] (03CR) 10Stevemunene: [C:03+2] Add the analytics-research-admin group to stat admins [puppet] - 10https://gerrit.wikimedia.org/r/1193412 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [13:58:12] (03CR) 10Elukey: [C:03+2] redfish: allow HTTP 204 responses in poll_task [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:59:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11241233 (10jhathaway) >>! In T371416#11241205, @jcrespo wrote: >> what steps did you take to re-image it > > I had to redo the HW RAID setup, whi... [14:01:02] (03PS24) 10Herron: thanos-rule: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) [14:01:02] (03CR) 10Herron: "ok this should be sorted out now! looks like override => true caused the unit file to be omitted from the catalog and after removing that " [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [14:02:09] (03PS2) 10Ssingh: site.pp and preseed.yaml: add hcaptcha VMs [puppet] - 10https://gerrit.wikimedia.org/r/1193423 (https://phabricator.wikimedia.org/T405631) [14:03:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11241235 (10jcrespo) >>! In T371416#11241233, @jhathaway wrote: > How did you configure the raid through the mgmt interface? I couldn't figure out... [14:08:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#11241245 (10jhathaway) >>! In T371416#11241235, @jcrespo wrote: > The https interface has a terrible GUI a bit hidden between submenus. ugh, I see... [14:11:23] (03PS1) 10Jcrespo: gerrit: Test hourly backup migration to the new storage host [puppet] - 10https://gerrit.wikimedia.org/r/1193424 (https://phabricator.wikimedia.org/T403946) [14:14:53] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:19:29] 06SRE, 10observability, 06Traffic: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000#11241282 (10ssingh) Can someone help me double-check if this is still a problem? I don't see it in the dashboards above, selecting a more recent time interval. [14:20:51] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11241295 (10ssingh) @Shisma: Hi, this still needs a Wikimedia affiliate approval as per https://lists.wikimedia.org/pipermail/maps-l/2020-August/001729.html. Can you pr... [14:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:28:18] 06SRE, 10observability, 06Traffic: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000#11241334 (10Vgutierrez) yes, it's still hapenning https://grafana.wikimedia.org/goto/SHdP6s3HR?orgId=1: {F66723380} I believe this will be fixed when we upgrade to HAProxy 3.0 given it pro... [14:29:31] 06SRE, 10observability, 06Traffic: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000#11241339 (10ssingh) >>! In T343000#11241334, @Vgutierrez wrote: > yes, it's still hapenning https://grafana.wikimedia.org/goto/SHdP6s3HR?orgId=1: {F66723380} > > I believe this will be fix... [14:32:37] (03CR) 10JHathaway: [C:03+2] wikimedia.support: initial mx support [puppet] - 10https://gerrit.wikimedia.org/r/1193183 (https://phabricator.wikimedia.org/T400952) (owner: 10JHathaway) [14:37:14] (03CR) 10JHathaway: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [14:38:39] (03PS2) 10Jcrespo: gerrit: Test hourly backup migration to the new storage host [puppet] - 10https://gerrit.wikimedia.org/r/1193424 (https://phabricator.wikimedia.org/T403946) [14:38:41] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193424 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [14:39:27] (03CR) 10Jon Harald Søby: [C:03+1] "+2 will be given as part of the deployment process, see my previous comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [14:42:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:43:30] (03CR) 10Jcrespo: [C:03+2] gerrit: Test hourly backup migration to the new storage host [puppet] - 10https://gerrit.wikimedia.org/r/1193424 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [14:50:37] (03PS1) 10Jcrespo: Revert "gerrit: Test hourly backup migration to the new storage host" [puppet] - 10https://gerrit.wikimedia.org/r/1193436 [14:54:06] (03CR) 10Jcrespo: [C:03+2] Revert "gerrit: Test hourly backup migration to the new storage host" [puppet] - 10https://gerrit.wikimedia.org/r/1193436 (owner: 10Jcrespo) [14:57:08] (03PS1) 10Elukey: profile::thanos: fix xlab SLI's recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) [14:59:02] (03PS5) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [15:00:40] (03CR) 10Jcrespo: "Consider if you want to remove some yaml file, or move the config there, that's up to you. I have pending doing some cleanup on the backup" [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [15:01:09] (03CR) 10Jcrespo: "*yaml config key" [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [15:02:16] (03CR) 10Jcrespo: [C:03+1] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [15:06:13] (03PS8) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:18] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11241546 (10Ahoelzl) Approved. [15:10:28] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11241548 (10Ahoelzl) Approved. [15:27:35] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2057.codfw.wmnet'] [15:28:04] (03PS1) 10Scott French: Fix pending form field preservation on validation failure [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193441 [15:29:23] (03CR) 10Scott French: [V:03+2] "Tested locally." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193441 (owner: 10Scott French) [15:37:09] (03CR) 10Scott French: [V:03+2 C:03+2] Fix pending form field preservation on validation failure [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193441 (owner: 10Scott French) [15:37:13] (03PS1) 10Jcrespo: backup: Reenable backup1012 notifications after setup [puppet] - 10https://gerrit.wikimedia.org/r/1193444 (https://phabricator.wikimedia.org/T403946) [15:37:49] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Fix pending form field preservation on validation failure - swfrench@cumin2002" [15:37:51] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Fix pending form field preservation on validation failure - swfrench@cumin2002 [15:38:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Fix pending form field preservation on validation failure - swfrench@cumin2002 [15:38:40] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Fix pending form field preservation on validation failure - swfrench@cumin2002" [15:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:40:15] (03CR) 10Jcrespo: [C:03+2] backup: Reenable backup1012 notifications after setup [puppet] - 10https://gerrit.wikimedia.org/r/1193444 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [15:44:18] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2057.codfw.wmnet'] [15:44:41] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058.codfw.wmnet'] [15:47:56] (03CR) 10Cathal Mooney: [C:03+2] Nokia: Add support for Python config generation and JSON-RPC API (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [15:49:41] (03PS1) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [15:50:05] (03CR) 10CI reject: [V:04-1] Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [15:52:04] (03PS2) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [16:00:57] (03Merged) 10jenkins-bot: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [16:04:09] (03PS1) 10LorenMora: Remove old, unused ArticleSummaries Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193447 (https://phabricator.wikimedia.org/T406361) [16:05:25] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06serviceops, 07Sustainability (Incident Followup): Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537#11242099 (10Aklapper) [16:05:31] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06serviceops, 07Sustainability (Incident Followup): Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535#11242100 (10Aklapper) [16:07:37] (03PS1) 10Majavah: P:toolforge: Move ru_monuments backwards compat redirect to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193448 (https://phabricator.wikimedia.org/T283948) [16:07:39] (03PS1) 10Majavah: P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) [16:07:42] (03PS1) 10Majavah: P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) [16:08:48] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:09:29] (03PS9) 10Elukey: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (https://phabricator.wikimedia.org/T392851) [16:10:50] (03CR) 10Elukey: sre.hardware.upgrade-firmware: add support for IDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [16:13:07] (03PS2) 10Majavah: P:toolforge: Move ru_monuments backwards compat redirect to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193448 (https://phabricator.wikimedia.org/T283948) [16:13:07] (03PS2) 10Majavah: P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) [16:13:07] (03PS2) 10Majavah: P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) [16:13:07] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Use http-after-response for headers [puppet] - 10https://gerrit.wikimedia.org/r/1193451 (https://phabricator.wikimedia.org/T283948) [16:14:08] (03PS1) 10Jasmine: switchdc/databases: update docs to include all current (x1 & x3) and future x* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1193452 (https://phabricator.wikimedia.org/T404464) [16:16:32] (03CR) 10Jcrespo: [C:03+1] switchdc/databases: update docs to include all current (x1 & x3) and future x* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1193452 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [16:16:52] (03PS3) 10Majavah: P:toolforge: Move ru_monuments backwards compat redirect to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193448 (https://phabricator.wikimedia.org/T283948) [16:16:56] (03PS3) 10Majavah: P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) [16:17:00] (03PS3) 10Majavah: P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) [16:17:32] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2058.codfw.wmnet'] [16:20:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11242196 (10elukey) Status update: I was able to upgrade idrac+bios of most of the cp hosts, I'll review the remaining ones on Monday and I'll give a precise li... [16:20:43] (03CR) 10Jasmine: [C:03+2] switchdc/databases: update docs to include all current (x1 & x3) and future x* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1193452 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [16:21:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11242202 (10ssingh) >>! In T392851#11242196, @elukey wrote: > Status update: I was able to upgrade idrac+bios of most of the cp hosts, I'll review the remaining... [16:25:55] (03PS4) 10Majavah: P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) [16:28:18] (03Merged) 10jenkins-bot: switchdc/databases: update docs to include all current (x1 & x3) and future x* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1193452 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [16:28:19] (03PS1) 10Majavah: P:toolforge::proxy: Remove config moved to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) [16:37:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:41:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:45] (03CR) 10Dzahn: [C:03+1] deployment_server: Add optional scap-clean-images systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) (owner: 10Ahmon Dancy) [16:47:09] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:50:39] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11242268 (10CDanis) 05Open→03Resolved Dashboards look good to me, let's call this finalized! [16:53:57] (03CR) 10Dzahn: [C:03+2] zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:57:31] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:59:05] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:59:53] (03CR) 10Jasmine: [C:03+2] wmnet: remove mwmaint discovery aliases since turning down production servers [0] [dns] - 10https://gerrit.wikimedia.org/r/1190309 (https://phabricator.wikimedia.org/T397017) (owner: 10Jasmine) [17:02:40] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [17:03:20] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: WIP [17:04:52] (03PS1) 10Cathal Mooney: Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) [17:06:51] (03CR) 10CI reject: [V:04-1] Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:08:08] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1002.eqiad.wmnet with reason: WIP [17:09:26] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:10:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:11:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:25] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2002.codfw.wmnet with reason: WIP [17:11:30] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:30] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:40] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:11:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:12:38] ah man, sorry folks, that was me [17:12:42] jasmine_ that you— [17:12:44] frick [17:12:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:13:18] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:13:18] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:14:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:14:20] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:14:29] jasmine_: no worries [17:14:56] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5d739aee5f9c5e9ff558b85b97d9aa77dd9a0511, dns.git is 34cf45e04361344edd34b7af7be3db0fc20ec2a7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:15:12] are you ok to fix? just a matter of running authdns_update ? [17:19:20] thanks topranks! yeah let me confirm it's okay to proceed and then go ahead and run it [17:19:43] no probs, just butting my head in as I am on call :) [17:19:45] i mistakenly submitted that +2 forgetting it was Friday 🤦 [17:20:03] ah yes ty, much appreciate! [17:20:49] haha no probs [17:21:01] I had a quick look at the patch, seems quite straightforward [17:21:13] I think "sudo authdns-update" on any of our authdns boxes will resolve the issue [17:21:15] (03PS1) 10Dzahn: zuul: fix typo in template, add zookeeper_server param to executor class [puppet] - 10https://gerrit.wikimedia.org/r/1193469 (https://phabricator.wikimedia.org/T395938) [17:21:31] (03CR) 10CI reject: [V:04-1] zuul: fix typo in template, add zookeeper_server param to executor class [puppet] - 10https://gerrit.wikimedia.org/r/1193469 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:23:03] I would recommend to first remove the DNS name from all other places in the repo before deleting it. [17:23:13] there could be more follow-ups lurking [17:24:03] yes +1 [17:24:27] see "grep -r mwmaint *" in the puppet repo [17:24:32] (03PS2) 10Cathal Mooney: Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) [17:24:41] ah yes that makes sense, ty mutante [17:24:50] thanks folks! [17:25:11] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406367 (10phaultfinder) 03NEW [17:26:03] I guess the easiest way to fix the alert while also not applying it is to submit a revert, merge both and then run an "empty" authdns-update [17:26:32] yeah that would work [17:26:41] essentially as long as the HEAD on the remote repo and the local directory are in sync [17:26:45] that's all the check is about [17:27:06] if git_authdns_head=$(git ls-remote $ORIGIN HEAD 2>/dev/null); then [17:27:38] (03PS1) 10Jasmine: Revert "wmnet: remove mwmaint discovery aliases since turning down production servers [0]" [dns] - 10https://gerrit.wikimedia.org/r/1193471 [17:27:58] !log jasmine@dns1004 START - running authdns-update [17:28:25] (03CR) 10Dzahn: [C:03+1] Revert "wmnet: remove mwmaint discovery aliases since turning down production servers [0]" [dns] - 10https://gerrit.wikimedia.org/r/1193471 (owner: 10Jasmine) [17:28:42] jasmine_: run it after merging the second change.. so that there is no actual change [17:29:14] like change 1 and change 2 will cancel each other out in the diff [17:30:21] (03CR) 10Jasmine: [C:03+2] Revert "wmnet: remove mwmaint discovery aliases since turning down production servers [0]" [dns] - 10https://gerrit.wikimedia.org/r/1193471 (owner: 10Jasmine) [17:30:43] yeah and I am guessing the current run will need to be cancelled [17:30:56] !log jasmine@dns1004 START - running authdns-update [17:31:39] !log jasmine@dns1004 START - running authdns-update [17:32:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:33:04] there you go:) [17:33:05] !log jasmine@dns1004 END - running authdns-update [17:33:18] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:33:18] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:33:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:33:28] (03PS3) 10Cathal Mooney: Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) [17:34:03] :D [17:34:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:34:14] I do feel this alert is very noisy [17:34:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:34:36] we should move it to alertmanager [17:34:56] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:35:42] still better than "number of DNS servers where authdns-update was not run is over treshold" ? :p [17:36:00] yep :) [17:36:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:20] problem is that there may actually be a case where one DNS server was not pooled for authdns-update [17:36:21] phew, silly mistake on my part!) thanks mutante and sukhe - much appreciate [17:36:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:30] so it may not be in sync with the rest of the pool [17:36:30] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:32] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:36:42] so there is certainly value in alerting per host but when all of them are not, perhaps one alert is better [17:36:48] jasmine_: no worries, sorry this is noisy [17:36:52] I will take up the fix next week [17:36:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [17:37:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:39:12] jasmine_: no worries. maybe you could do patches to remove those from scap config and tcpircbot, hieradata.. I would do separate patches and worry about them next week [17:39:13] +1 to moving it to prometheus-based alerts. I feel like there should be a not-too-wild way to make this "equivalently informative" (tells you which hosts) but "less noisy" (only one alert) [17:39:27] yeah [17:39:42] when we first this, this spamming was somewhat intentional, though a bad decision on hindsight [17:39:46] since we are now at 16 DNS boxes [17:40:00] * sukhe adds it to his list for next week [17:40:03] still definitely better to have than not at all! :) [17:40:20] imho it's more important to make those alert using a different method that isn't "IRC only" [17:41:00] mutante: we can do email alertmanager as well. though I doubt we should make it paging [17:41:08] I would say we should but I know I am probably wrong on that :P [17:41:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:49] email (to the right list), automatic task (with the right tags). maybe the biggest question is what list and what tags [17:43:14] list cannot be sre-traffic since well, others run authdns-update too and perhaps more than we do [17:43:22] but if it is root@, no one will read it there [17:43:23] :) [17:43:25] but either way it seems first step is to not have 10 times the same one [17:43:38] yep, 16 times since 16 hosts but yeah [17:43:48] a total of 32 lines including recovery [17:44:43] that being said, it does seem possible to have a situation where one of the DNS servers was not updated but the others were [17:44:50] if the script gets interrupted [17:44:56] yep [17:45:15] that's a big no-no from an operational point of view [17:45:26] (03CR) 10Aude: [C:04-1] Add ReadingList Stream to EventStreamConfig (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [17:46:17] indeed... I'd be somewhat agreeable to this being a paging alert fwiw sukhe. but only cos I love dns (reverse v6) so much! [17:46:27] well.. as long is it's still in Icinga .. it's easier because we can just put any logic in a bash script. including "alert if this is not the same for all 16 servers" [17:46:32] topranks: stop chasing that dopamine! [17:46:38] hahaha [17:46:47] once you move it to alertmanager.. not sure [17:47:40] mutante: yeah I hear you [17:51:11] ironically the fact that it's noisy with so many lines makes it more likely that it gets a swift reaction like that [17:51:47] a good way to say it's important but slightly under the level of p.aging :p [17:52:28] that's what I meant by [17:52:29] > 13:39:41 < sukhe> when we first this, this spamming was somewhat intentional, [17:52:32] :P [17:53:00] :) yea. a wall of text does not get overlooked, hah [17:54:29] (03PS3) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [17:56:07] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:56:12] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [17:56:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:57:16] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [17:58:50] I dunno mutante, if there's anything I've learned from lurking here it's that a wall of text yelling about some piece of infrastructure having GONE CRITICAL for hours at end actually being the equivalent severity of a "didn't wash hands" sign at a bathroom, while a single line saying something isn't "replicating" actually means the internet is about to explode. [17:59:15] perryprog: yeah we do have serious alert apathy in some ways [17:59:26] only when it's OSM :P [17:59:27] I wouldn't deny that at all [17:59:40] this one though is pretty critical IMO but I may be biased [17:59:57] no +1 if the dns somehow gets borked that's it [17:59:58] we did have a case in which we assumed a patch was merged and authdns-udpate was run but we only discovered it wasn't days after it [18:00:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:00:21] I'm just sad I didn't get the volunteer cred for noticing the patch that caused it and mentioning it fast enough 😔 [18:00:27] (that's all that matters, after all) [18:01:00] lol [18:01:30] perryprog: <3 [18:02:23] Maybe get some fancy new physical alert p.agers that gives you a shock every time icinga fires a message—that'd reduce the alert apathy pretty fast. :) [18:02:48] (03PS4) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [18:02:49] haha [18:03:15] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [18:03:58] perryprog: tbh I think that the whole approach of "SREs look at an IRC channel for lines from a bot" didn't scale with the growth of SRE into many channels and subteams [18:05:38] would suggest to replace the alerting method with something async [18:06:10] I mean, I guess there's the pro of having alerts being sent in the same place that people talk in where it effectively forces you to look at it even if you were going to do something unrelated. [18:06:15] speaking just for my subteam.. everything is a ticket and I like it [18:07:46] perryprog: that was true and I used the same argument. but then channels were made specifically to "have a place where we can talk" (without the bots) [18:07:50] this is a classic cycle [18:09:00] first a channel.. then bots get added.. we like that it's the same place.. then bots are considered noisy.. more channels are made to have a place withou the noise.. then conversations are split between multiple places [18:10:29] ML model that heuristically chooses the most conversationally disruptive location to post monitoring alerts.... [18:10:36] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:11:53] hehe. yea.. I mean.. it can't be both "noise" and "alert" at the same time [18:12:03] if it's noise then it should not try to alert in the first place [18:12:24] aka. "if this can be ignored then remove it" ? [18:13:52] Yeah I always assumed that was part of it, but then you sort of have a tragedy of a commons thing. An individual alert that isn't really that important isn't disruptive enough on its own that any one person would go out of their way to disable it, and it's easier to just assume someone else will manage cleaning it up. [18:14:53] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:14:54] yea, and for that type of alert that is basically why I would advocate to change alerts to emails and forget the realtime chat [18:15:23] Speaking of OSM [18:15:31] also..we used to have emails from icinga. it just stopped. [18:17:48] emails are async anyway, critical issues dont belong imo [18:17:52] perryprog: seems like the runbook says it's a known issue and to ignore it [18:18:13] we should demote it to warning then. [18:19:21] brett: I think part of the issue is that there is an entire class of issues that are not considered critical enough for paging.. but also not unimportant enough to remove monitoring.. which are likely to not get any response if the only alerting method is a line in a fast-moving chat [18:19:50] yea, that makes them actually just warnings [18:20:19] for warnings it might be appropriate to async inform a team [18:23:10] mutante: https://phabricator.wikimedia.org/p/phaultfinder/ [18:23:14] demote to warnings .. yes.. but the immediate follow-up question is "how should warnings alert people" [18:23:58] cdanis: yea! that's what we are using in my subteam [18:24:47] email or task (personally I think task is even better), +1 [18:24:53] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:37:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [18:42:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:55:25] (03CR) 10Jdlrobson: [C:03+1] Remove old, unused ArticleSummaries Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193447 (https://phabricator.wikimedia.org/T406361) (owner: 10LorenMora) [19:06:18] (03PS2) 10Dzahn: zuul: fix typo in template, add zookeeper_server param to executor class [puppet] - 10https://gerrit.wikimedia.org/r/1193469 (https://phabricator.wikimedia.org/T395938) [19:07:10] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [19:07:20] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1002.eqiad.wmnet with reason: WIP [19:07:36] (03CR) 10Dr0ptp4kt: profile::thanos: fix xlab SLI's recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [19:09:21] (03CR) 10Dzahn: [C:03+2] zuul: fix typo in template, add zookeeper_server param to executor class [puppet] - 10https://gerrit.wikimedia.org/r/1193469 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:10:46] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:12:11] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11242664 (10VRiley-WMF) Hey @wiki_willy I just tried to login to no avail. I have already emailed the support team again as well [19:15:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 3 others: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801#11242668 (10VRiley-WMF) a:03VRiley-WMF [19:29:41] (03PS1) 10Dzahn: zuul: use zuul_main_nodes to determine zookeeper server [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) [19:29:57] (03CR) 10CI reject: [V:04-1] zuul: use zuul_main_nodes to determine zookeeper server [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:32:35] (03PS2) 10Dzahn: zuul: use zuul_main_nodes to determine zookeeper server [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) [19:33:01] (03CR) 10CI reject: [V:04-1] zuul: use zuul_main_nodes to determine zookeeper server [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:33:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11242738 (10BTracy-WMF) 05Resolved→03Open When trying to access https://superset.wikimedia.org, I'm receiving the error message "Service access denied d... [19:37:56] !log LDAP added user btracy to group wmf T405366 [19:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:59] T405366: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366 [19:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:41:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11242771 (10Dzahn) Yes, it seems like that is the case. This goes back to T405366#11210719. I think you also needed to request the LDAP group "wmf". Which... [19:44:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11242774 (10BTracy-WMF) 05Open→03Resolved I have access now, thanks @Dzahn ! [19:44:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11242776 (10Dzahn) 05Resolved→03Open [19:45:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11242778 (10Dzahn) 05Open→03Resolved ah:) cool. I did not mean to open it again. that was just because I had the tab alrea... [19:48:45] (03PS3) 10Dzahn: zuul: use zuul_main_nodes to determine zookeeper server [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) [20:04:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1193487/7194/" [puppet] - 10https://gerrit.wikimedia.org/r/1193487 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:56:56] Hi all, Ahmon Dancy from the SRE team pointed me to this channel as I'm seeking help with a GitLab CI/CD job trying to publish the Docker image to our registry. It keeps failing with the following message: error: failed to solve: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invali [20:56:57] d [21:09:55] amastilovic: I'm going to file a phab ticket for that issue [21:09:55] (03PS1) 10Hamish: Allow AbuseFilter to block on ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193502 (https://phabricator.wikimedia.org/T406220) [21:10:04] amastilovic: I would suggest asking in #wikimedia-releng to see if anything has changed recently involving the gitlab runners [21:10:49] dancy: this is `err.code="blob upload invalid" err.detail="blob invalid length"` which suggests the client is not uploading all blob chunks before closing [21:11:03] oh hey Scott. I sent amastilovic here to ask if someone could help see the registry system related logs around the time of this failed pushes. [21:11:07] @smfrench-wmf I already did, they've seen the error message before but never got down to the root of it [21:11:09] (03CR) 10CI reject: [V:04-1] Allow AbuseFilter to block on ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193502 (https://phabricator.wikimedia.org/T406220) (owner: 10Hamish) [21:12:59] dancy: thanks for opening a task! when you have one, I'll add the logs I'm seeing on the registry hosts, and we can go from there [21:13:10] amastilovic: ah, got it - thanks for the context [21:14:27] OK it sounds like we've got the investigation going, tyvm all [21:17:26] swfrench-wmf: https://phabricator.wikimedia.org/T406392 filed [21:17:53] dancy: thanks! [21:39:02] (03CR) 10JHathaway: [C:03+1] "Looks good, it would be nice to not preclude adapting this for Supermicro, but perhaps that is wasted effort now?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [21:39:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:41:15] (03CR) 10BCornwall: "I would suggest using pathlib instead of os for path operations to avoid numerous os pitfalls." [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [21:41:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:14] What's been going on with cr2-eqiad? I see a lot of activity with that device in the backscroll [21:46:12] Ongoing work in https://phabricator.wikimedia.org/T402588 ? [21:46:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:53:20] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:54:14] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:54:22] PROBLEM - MD RAID on druid1011 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:54:23] ACKNOWLEDGEMENT - MD RAID on druid1011 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T406394 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:54:28] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394 (10ops-monitoring-bot) 03NEW [21:55:52] brett: probably? looks like ssw1-d8-eqiad is indeed part of the C/D switch refresh (new device) [22:00:05] (03PS1) 10JHathaway: provision: ensure CSMSupport is enabled in MBR mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1193511 [22:04:14] (03CR) 10BCornwall: [C:03+1] wmnet: remove wikikube-ctrl1001 from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1193266 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [22:14:53] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:15:34] (03PS5) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [22:17:01] (03CR) 10Krinkle: varnish: misc VTC quality of life improvements (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [22:19:08] (03PS6) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [22:21:54] (03PS7) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [22:23:27] (03PS8) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [22:23:34] (03PS11) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [22:24:53] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:27:39] (03CR) 10BCornwall: varnish: misc VTC quality of life improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [22:42:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:43:33] (03PS8) 10Krinkle: varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 [22:44:23] (03CR) 10Krinkle: varnish: misc VTC quality of life improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [22:48:00] (03CR) 10BCornwall: varnish: misc VTC quality of life improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [23:10:20] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:20:10] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired