[00:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099392 [00:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099392 (owner: 10TrainBranchBot) [00:56:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099392 (owner: 10TrainBranchBot) [01:02:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099395 [01:08:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099395 (owner: 10TrainBranchBot) [01:09:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:11] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099395 (owner: 10TrainBranchBot) [01:41:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:01:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:20:58] PROBLEM - SSH on bast7001 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:21:58] RECOVERY - SSH on bast7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:30:55] (03PS1) 10Tim Starling: Add messages for Indonesian Wikivoyage (idwikivoyage) [extensions/WikimediaMessages] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099405 (https://phabricator.wikimedia.org/T380726) [03:37:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [03:37:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099405 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [03:38:12] (03Merged) 10jenkins-bot: Create id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [03:39:40] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:54:43] (03Merged) 10jenkins-bot: Add messages for Indonesian Wikivoyage (idwikivoyage) [extensions/WikimediaMessages] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099405 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [03:55:25] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1098918|Create id.wikivoyage.org (T380726 T352113)]], [[gerrit:1099405|Add messages for Indonesian Wikivoyage (idwikivoyage) (T380726)]] [03:55:28] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [03:55:29] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:12:28] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1098918|Create id.wikivoyage.org (T380726 T352113)]], [[gerrit:1099405|Add messages for Indonesian Wikivoyage (idwikivoyage) (T380726)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [04:12:31] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:12:32] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:13:13] !log tstarling@deploy2002 tstarling: Continuing with sync [04:26:30] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098918|Create id.wikivoyage.org (T380726 T352113)]], [[gerrit:1099405|Add messages for Indonesian Wikivoyage (idwikivoyage) (T380726)]] (duration: 31m 05s) [04:26:40] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:26:40] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:26:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:36:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:39:26] !log on db2123: grant alter ON `%wik%`.* TO `wikiadmin2023`@`10.%` [04:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:26] !log installed id.wikivoyage.org [04:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:55:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:02:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:46] !log on mwmaint2002: mwscript extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=idwikivoyage --force-protocol=https [05:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:20:28] !log foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol=https [05:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:33:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:34:13] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:35:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:36:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:58] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:40:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:09] (03PS1) 10Abijeet Patro: Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) [05:55:39] (03PS2) 10Abijeet Patro: Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) [05:55:54] (03PS3) 10Abijeet Patro: Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) [05:58:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:00:44] (03CR) 10Marostegui: "What's pending to get this merged?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [06:01:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:02:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [06:08:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:14:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:16:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:19:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:20:13] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:21:06] (03CR) 10Wangombe: [C:03+1] Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [06:30:13] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:37:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:46:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:54:04] Lucas_WMDE: Due to issues, I never started the schema change on s8 that will affect wiki replicas (https://phabricator.wikimedia.org/T367856) I'd like to start it today or tomorrow if you can coordinate the reminder on tech news? [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:07:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:22] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:22] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:42] (03PS1) 10Marostegui: Revert "db1233: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1099525 [07:19:38] (03CR) 10Marostegui: [C:03+2] Revert "db1233: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1099525 (owner: 10Marostegui) [07:27:56] (03PS1) 10Muehlenhoff: Remove access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/1099636 [07:28:39] (03CR) 10Nikerabbit: [C:03+1] Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:28:47] (03CR) 10CI reject: [V:04-1] Remove access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/1099636 (owner: 10Muehlenhoff) [07:30:31] (03PS2) 10Muehlenhoff: Remove access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/1099636 [07:31:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:34:05] (03PS1) 10Brouberol: airflow: use an httpGet check for the scheduler liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099637 (https://phabricator.wikimedia.org/T381234) [07:35:12] (03CR) 10CI reject: [V:04-1] airflow: use an httpGet check for the scheduler liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099637 (https://phabricator.wikimedia.org/T381234) (owner: 10Brouberol) [07:36:03] (03PS2) 10Brouberol: airflow: use an httpGet check for the scheduler liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099637 (https://phabricator.wikimedia.org/T381234) [07:37:58] PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: mswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:39:37] will fix [07:39:40] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:40:41] fixed [07:40:58] RECOVERY - MariaDB Replica SQL: s3 on db1150 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:46:25] (03CR) 10Slyngshede: [C:03+1] "LGTM :-(" [puppet] - 10https://gerrit.wikimedia.org/r/1099636 (owner: 10Muehlenhoff) [07:47:46] (03CR) 10Muehlenhoff: [C:03+2] Remove access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/1099636 (owner: 10Muehlenhoff) [07:51:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:56:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10369997 (10MoritzMuehlenhoff) [07:57:14] (03CR) 10Brouberol: [C:03+2] Move hourly gobblin event start-time later [puppet] - 10https://gerrit.wikimedia.org/r/1099010 (https://phabricator.wikimedia.org/T376144) (owner: 10Joal) [07:59:46] (03CR) 10Arnaudb: "The depends-on comment of the CR links to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090859 which is the remaining dependency o" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T0800). [08:00:05] sd0001 and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [08:00:20] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [08:00:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [08:00:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [08:00:41] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [08:00:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [08:01:03] (03CR) 10Muehlenhoff: [C:03+2] cloudweb/codfw1dev: Use firewall::service for firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1098952 (owner: 10Muehlenhoff) [08:01:26] I can deploy abijeet's patch. [08:02:11] hi kart_ , thanks [08:07:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [08:07:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10370027 (10ops-monitoring-bot) Draining ganeti1009.eqiad.wmnet of running VMs [08:09:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [08:09:38] kart_: I have a patch to sync when you're finished https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099221 [08:09:52] Sure. [08:09:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099221 (https://phabricator.wikimedia.org/T381178) (owner: 10Máté Szabó) [08:11:09] sd0001 isn't around, so I'll start.. [08:11:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [08:11:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [08:11:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10370040 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [08:12:20] (03Merged) 10jenkins-bot: Translate: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [08:12:22] (03CR) 10Marostegui: "Let's get that one out of the way then and we can merge. THis is a very useful cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [08:12:42] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1099410|Translate: Enable message group subscription feature for some wikis (T372386)]] [08:12:44] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:14:59] (03CR) 10Btullis: [C:03+1] airflow: use an httpGet check for the scheduler liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099637 (https://phabricator.wikimedia.org/T381234) (owner: 10Brouberol) [08:15:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [08:15:34] (03CR) 10Brouberol: [C:03+2] airflow: use an httpGet check for the scheduler liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099637 (https://phabricator.wikimedia.org/T381234) (owner: 10Brouberol) [08:16:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:18:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:19:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:19:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:19:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:20:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:20:37] abijeet: We got some issue with legalteamwiki: https://pastebin.com/JkFyUZ4u [08:21:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:21:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:21:15] kart_: I think this is the same as T380958, maybe [08:21:16] T380958: httpb fails upon deployment of 1.44.0-wmf.5 - https://phabricator.wikimedia.org/T380958 [08:21:33] kart_, checking [08:23:33] kostajh: is that blocker? Should I revert or continue? Do you know more details? [08:23:36] kart_, I can't immediately see what the issue is. [08:24:44] (03CR) 10Slyngshede: [C:03+2] C:apereo_cas disable registry cleaner [puppet] - 10https://gerrit.wikimedia.org/r/1089638 (owner: 10Slyngshede) [08:25:21] abijeet: let me continue and see. [08:25:24] !log kartik@deploy2002 abi, kartik: Backport for [[gerrit:1099410|Translate: Enable message group subscription feature for some wikis (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:25:27] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:25:36] abijeet: can you test? [08:25:48] kart_, ok checking [08:27:04] (03CR) 10Elukey: [C:03+1] Blacklist erofs [puppet] - 10https://gerrit.wikimedia.org/r/1099204 (owner: 10Muehlenhoff) [08:28:13] (03PS1) 10Slyngshede: data.yaml Remove access for drochford [puppet] - 10https://gerrit.wikimedia.org/r/1099639 [08:29:06] (03CR) 10CI reject: [V:04-1] data.yaml Remove access for drochford [puppet] - 10https://gerrit.wikimedia.org/r/1099639 (owner: 10Slyngshede) [08:29:12] kart_, looks ok [08:29:22] 10ops-codfw, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdg) failed on moss-be2002 - https://phabricator.wikimedia.org/T381239 (10MatthewVernon) 03NEW [08:29:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdg) failed on moss-be2002 - https://phabricator.wikimedia.org/T381239#10370103 (10MatthewVernon) p:05Triage→03Medium [08:29:42] !log kartik@deploy2002 abi, kartik: Continuing with sync [08:30:22] kart_: it's not a blocker, aiui [08:31:49] FIRING: HelmReleaseBadStatus: Helm release airflow-analytics-test/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-analytics-test - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:32:10] kostajh: Thanks. I continued deployment :) [08:33:00] (03PS5) 10Slyngshede: P:idp Add blackbox probe to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) [08:33:48] (03PS2) 10Volans: puppetdb: optimize query [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [08:34:51] (03CR) 10Volans: "I took the liberty to resume this CR, adapt it a bit and I think is now ready to be merged. Thanks for the original submission John, it's " [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [08:34:52] (03PS2) 10Slyngshede: data.yaml Remove access for drochford [puppet] - 10https://gerrit.wikimedia.org/r/1099639 [08:35:26] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [08:36:21] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099410|Translate: Enable message group subscription feature for some wikis (T372386)]] (duration: 23m 39s) [08:36:23] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:37:51] kart_, thanks! [08:39:11] kostajh: you can go ahead. Sorry for delay. [08:40:04] kart_: thx [08:40:26] (03CR) 10Muehlenhoff: data.yaml Remove access for drochford (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099639 (owner: 10Slyngshede) [08:41:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099221 (https://phabricator.wikimedia.org/T381178) (owner: 10Máté Szabó) [08:41:52] (03Merged) 10jenkins-bot: Remove unused IRS config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099221 (https://phabricator.wikimedia.org/T381178) (owner: 10Máté Szabó) [08:41:52] (03CR) 10Muehlenhoff: [C:03+2] Blacklist erofs [puppet] - 10https://gerrit.wikimedia.org/r/1099204 (owner: 10Muehlenhoff) [08:42:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:42:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:43:18] Ok, I'm done. It was labs only, so no sync is necessary [08:43:44] anything else to deploy? [08:44:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:45:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:45:03] (03CR) 10Slyngshede: [C:03+2] P:idp Add blackbox probe to production IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:46:08] Nothing from me. [08:46:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-analytics-test/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-analytics-test - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:47:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:47:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:49:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:49:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:50:44] (03CR) 10Elukey: "Leaving some random comments, hope they make sense :) Nice work!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [08:52:56] !log restarting blazegraph on wdqs1019 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:13] (03CR) 10Muehlenhoff: [C:03+2] ci: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092863 (owner: 10Muehlenhoff) [08:58:42] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled ht [08:58:42] kitech.wikimedia.org/wiki/PyBal [08:58:44] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs10 [08:58:44] .wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:59:33] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2086.codfw.wmnet with OS bullseye [09:00:23] looking at wdqs, but (as usual) most likely a client abusing the service [09:01:49] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:02:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:02:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:02:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:04:29] FIRING: [5x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:59] (03PS2) 10Brouberol: airflow: gather all egress rules under a single network policyt for each component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099642 (https://phabricator.wikimedia.org/T377926) [09:09:16] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [09:09:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:29] RESOLVED: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [09:13:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10370220 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [09:16:48] (03PS3) 10Slyngshede: data.yaml Remove access for drochford [puppet] - 10https://gerrit.wikimedia.org/r/1099639 [09:17:00] (03CR) 10Slyngshede: data.yaml Remove access for drochford (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099639 (owner: 10Slyngshede) [09:20:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:20:56] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [09:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:23:48] (03PS1) 10Mvolz: Increment chart for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099646 (https://phabricator.wikimedia.org/T369084) [09:23:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:24:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:24:18] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093323 (owner: 10PipelineBot) [09:24:25] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093325 (owner: 10PipelineBot) [09:24:27] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:38] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098079 (owner: 10PipelineBot) [09:24:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [09:26:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:27:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:28:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1198.eqiad.wmnet with reason: testing [09:28:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:28:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:28:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1198.eqiad.wmnet with reason: testing [09:28:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depoll db1198 to install 10.6.20', diff saved to https://phabricator.wikimedia.org/P71460 and previous config saved to /var/cache/conftool/dbconfig/20241202-092854-marostegui.json [09:29:27] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:16] (03PS1) 10Elukey: services: add tcp health checks to Tegola's eqiad/codfw configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099649 (https://phabricator.wikimedia.org/T322647) [09:34:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:35:18] !log Installing mariadb 10.6.20 on db1198 T378940 [09:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:21] T378940: Compile and package MariaDB 10.11.10 and 10.6.20 - https://phabricator.wikimedia.org/T378940 [09:35:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:35:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:36:29] (03CR) 10JMeybohm: [C:03+1] services: add tcp health checks to Tegola's eqiad/codfw configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099649 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:39:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:39:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:39:38] (03CR) 10Elukey: [C:03+1] "Not 100% familiar with the code but IIUC this is needed to avoid returning a ton of unnecessary data from puppetdb, making easier for it t" [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [09:39:45] (03CR) 10Jforrester: [C:03+1] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (https://phabricator.wikimedia.org/T381123) (owner: 10Arlolra) [09:39:49] (03CR) 10Elukey: [C:03+2] services: add tcp health checks to Tegola's eqiad/codfw configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099649 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:41:54] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [09:42:20] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [09:42:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) (owner: 10Gergő Tisza) [09:42:35] jouncebot: nowandnext [09:42:36] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [09:42:36] In 1 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1100) [09:43:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: optimizing [09:43:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: optimizing [09:44:33] (03CR) 10Btullis: [C:03+1] airflow: gather all egress rules under a single network policyt for each component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099642 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [09:45:00] (03PS1) 10Brouberol: ceph-csi: fix label selector for kubernetes calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099650 [09:45:18] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [09:45:36] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [09:45:39] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2086.codfw.wmnet with OS bullseye [09:45:48] 10ops-codfw, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381244 (10JMeybohm) 03NEW [09:45:50] (03CR) 10Volans: [C:03+2] puppetdb: optimize query [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [09:46:58] (03CR) 10Btullis: [C:03+1] "LGTM. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099650 (owner: 10Brouberol) [09:47:49] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2436-2437].codfw.wmnet [09:48:05] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10370345 (10JMeybohm) [09:48:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1099639 (owner: 10Slyngshede) [09:48:38] anyone mind if I deploy citoid to staging only (eqiad)? [09:48:55] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2436-2437].codfw.wmnet [09:49:07] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108#10370350 (10elukey) @klausman Hi! Could ML take care of this request? [09:50:16] (03CR) 10Stevemunene: [C:03+1] ceph-csi: fix label selector for kubernetes calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099650 (owner: 10Brouberol) [09:50:53] (03CR) 10Brouberol: [C:03+2] ceph-csi: fix label selector for kubernetes calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099650 (owner: 10Brouberol) [09:51:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:51:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:52:25] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2087.codfw.wmnet with OS bullseye [09:52:32] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[2436-2437].codfw.wmnet with reason: rename/reimage [09:52:49] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[2436-2437].codfw.wmnet with reason: rename/reimage [09:54:03] (03CR) 10Elukey: "Approval from thcipriani@ still missing on the task, we'll be able to merge after it. Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (https://phabricator.wikimedia.org/T381123) (owner: 10Arlolra) [09:56:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [09:57:21] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10370412 (10elukey) 05Stalled→03Declined Setting it to declined for the moment, please re-open when a final decisi... [09:58:24] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10370380 (10elukey) @thcipriani Hi! Could you review this request? Lemme know if we can move forward :) [09:59:19] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:09] (03CR) 10Slyngshede: [C:03+2] data.yaml Remove access for drochford [puppet] - 10https://gerrit.wikimedia.org/r/1099639 (owner: 10Slyngshede) [10:02:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10370438 (10elukey) This task needs T380487 to be completed first, otherwise the `analytics-privatedata-users` settings will not be useful (since the user will not... [10:03:02] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2436.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:03:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 322, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:04:08] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2437.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:04:15] (03CR) 10Alexandros Kosiaris: [C:03+1] services: add tcp health checks to Tegola's eqiad/codfw configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099649 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [10:04:34] marostegui: I’ve added it back to the weekly summary (next edition should go out today) https://www.wikidata.org/w/index.php?title=Wikidata:Status_updates/Next&diff=prev&oldid=2282646493 [10:04:49] (03Merged) 10jenkins-bot: puppetdb: optimize query [software/cumin] - 10https://gerrit.wikimedia.org/r/953295 (owner: 10Jbond) [10:05:01] I don’t think we announced it anywhere else so I’d say you’re probably good to go ahead [10:05:13] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [10:06:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:44] Lucas_WMDE: thank you very much! I will start it then :) [10:09:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [10:09:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 8:00:00 on db1167.eqiad.wmnet with reason: alter [10:09:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 8:00:00 on db1167.eqiad.wmnet with reason: alter [10:10:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: alter [10:10:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: alter [10:10:38] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 322, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 238, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:27] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool db1167 for an alter table', diff saved to https://phabricator.wikimedia.org/P71461 and previous config saved to /var/cache/conftool/dbconfig/20241202-101225-marostegui.json [10:12:50] !log Deploy schema change on db1167 - s8 sanitarium master, there will be days of lag in wikireplicas in s8 T367856 [10:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:53] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:12:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [10:14:03] (03CR) 10Volans: [C:03+2] mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 (owner: 10Volans) [10:16:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [10:16:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2436.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:16:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2437.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:17:22] (03PS1) 10Ladsgroup: Enable new ParserCache key schema on every page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099653 (https://phabricator.wikimedia.org/T373037) [10:21:32] (03CR) 10Jforrester: [C:03+1] UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 (owner: 10Reedy) [10:25:20] jouncebot: nowandnext [10:25:20] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [10:25:20] In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1100) [10:25:28] (03Abandoned) 10Volans: tox: add python 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/953294 (owner: 10Jbond) [10:25:29] (03CR) 10Ladsgroup: [C:03+2] Enable new ParserCache key schema on every page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099653 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [10:25:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099653 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [10:26:27] (03Merged) 10jenkins-bot: Enable new ParserCache key schema on every page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099653 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [10:26:43] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1099653|Enable new ParserCache key schema on every page (T373037)]] [10:26:45] this patch is going inflict pain on the infrastructure but there is no way around it [10:26:46] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:27:58] (03CR) 10KCVelaga: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:28:00] (03Merged) 10jenkins-bot: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 (owner: 10Volans) [10:28:28] (03CR) 10KCVelaga: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:31:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2087.codfw.wmnet with OS bullseye [10:32:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:32:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:33:22] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1099653|Enable new ParserCache key schema on every page (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:33:24] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:35:28] (03CR) 10Volans: [C:03+2] mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [10:35:35] (03CR) 10CI reject: [V:04-1] mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [10:35:54] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [10:36:23] (03PS7) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [10:36:23] (03PS4) 10Volans: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 [10:36:46] (03PS1) 10Jelto: Rename kubernetes1017 to wikikube-worker1005 [puppet] - 10https://gerrit.wikimedia.org/r/1099654 (https://phabricator.wikimedia.org/T377876) [10:37:25] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:37:48] (03PS1) 10JMeybohm: Rename mw243[67] to wikikube-worker200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1099655 (https://phabricator.wikimedia.org/T377877) [10:38:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [10:41:54] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes1017 to wikikube-worker1005 [puppet] - 10https://gerrit.wikimedia.org/r/1099654 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:42:06] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1099655 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [10:42:18] (03CR) 10JMeybohm: [C:03+2] Rename mw243[67] to wikikube-worker200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1099655 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [10:43:21] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13986MiB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:44:09] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099653|Enable new ParserCache key schema on every page (T373037)]] (duration: 17m 25s) [10:44:12] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:44:12] (03CR) 10JMeybohm: [C:03+2] Rename kubernetes1017 to wikikube-worker1005 [puppet] - 10https://gerrit.wikimedia.org/r/1099654 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:45:11] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1017.eqiad.wmnet [10:45:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1017.eqiad.wmnet [10:46:08] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2436 to wikikube-worker2005 [10:46:19] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:46:28] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2437 to wikikube-worker2006 [10:46:46] (03PS1) 10Hashar: ci: add WikimediaMessages to git cache [puppet] - 10https://gerrit.wikimedia.org/r/1099657 (https://phabricator.wikimedia.org/T374717) [10:47:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099658 [10:48:33] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:52] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [10:50:02] (03CR) 10Effie Mouzeli: [C:03+1] O:openstack: cloudweb: Drop nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1094492 (https://phabricator.wikimedia.org/T371378) (owner: 10Majavah) [10:50:06] (03CR) 10Kosta Harlan: "Caused T381250" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099410 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [10:50:42] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2436 to wikikube-worker2005 - jayme@cumin2002" [10:51:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2436 to wikikube-worker2005 - jayme@cumin2002" [10:51:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:51:10] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2005 [10:51:22] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:51:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2005 [10:52:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [10:52:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2436 to wikikube-worker2005 [10:53:21] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-6h&to=now&viewPanel=25 [10:53:24] I'm scared [10:53:52] Amir1: expected? [10:54:00] yeah [10:54:46] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2437 to wikikube-worker2006 - jayme@cumin2002" [10:54:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2437 to wikikube-worker2006 - jayme@cumin2002" [10:54:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:54:52] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2006 [10:55:05] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2006 [10:55:12] (03PS1) 10Marostegui: db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099660 [10:55:30] (03CR) 10Marostegui: [C:03+1] mariadb: productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:55:37] (03PS3) 10Abijeet Patro: Translate: Disable message group subscription feature for legalteamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) [10:55:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2437 to wikikube-worker2006 [10:55:59] (03CR) 10Marostegui: [C:03+2] db1167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099660 (owner: 10Marostegui) [10:56:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381244#10370706 (10JMeybohm) [10:57:07] (03CR) 10Kosta Harlan: Translate: Disable message group subscription feature for legalteamwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [10:59:48] (03CR) 10Clément Goubert: Translate: Disable message group subscription feature for legalteamwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:00:00] (03PS4) 10Abijeet Patro: Translate: Disable message group subscription feature for legalteamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) [11:00:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1017 to wikikube-worker1005 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1100) [11:00:23] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:00:36] (03CR) 10Abijeet Patro: Translate: Disable message group subscription feature for legalteamwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:01:08] (03CR) 10Jforrester: [C:03+1] "Oh, good idea, very frequently referenced repo for CI jobs and lots of noisy commit activity." [puppet] - 10https://gerrit.wikimedia.org/r/1099657 (https://phabricator.wikimedia.org/T374717) (owner: 10Hashar) [11:01:14] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2005.codfw.wmnet wikikube-worker2006.codfw.wmnet on all recursors [11:01:17] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2005.codfw.wmnet wikikube-worker2006.codfw.wmnet on all recursors [11:02:06] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2006.codfw.wmnet with OS bookworm [11:02:17] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2006 [11:02:41] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2005.codfw.wmnet with OS bookworm [11:04:17] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1017 to wikikube-worker1005 - jelto@cumin1002" [11:05:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1017 to wikikube-worker1005 - jelto@cumin1002" [11:05:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:05:02] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1005 [11:05:29] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:06:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1005 [11:06:49] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10370772 (10JMeybohm) [11:07:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1017 to wikikube-worker1005 [11:07:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:08:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:08:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:08:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:08:48] (03Merged) 10jenkins-bot: Translate: Disable message group subscription feature for legalteamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:09:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:09:06] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1099659|Translate: Disable message group subscription feature for legalteamwiki (T372386 T381250)]] [11:09:10] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [11:09:11] T381250: Uncaught MediaWiki\Config\ConfigException: Translate: Message group subscriptions (TranslateEnableMessageGroupSubscription) are enabled but Echo extension is not installed - https://phabricator.wikimedia.org/T381250 [11:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:14] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2006 - jayme@cumin2002" [11:13:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2006 - jayme@cumin2002" [11:13:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:21] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2006.codfw.wmnet 141.32.192.10.in-addr.arpa 1.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:13:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2006.codfw.wmnet 141.32.192.10.in-addr.arpa 1.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:13:25] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2006 [11:13:35] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2006 [11:13:36] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2006 [11:14:20] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2005 [11:14:25] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:14:55] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache kubernetes1017.eqiad.wmnet wikikube-worker1005.eqiad.wmnet on all recursors [11:14:58] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1099659|Translate: Disable message group subscription feature for legalteamwiki (T372386 T381250)]] [11:14:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubernetes1017.eqiad.wmnet wikikube-worker1005.eqiad.wmnet on all recursors [11:15:01] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [11:15:02] T381250: Uncaught MediaWiki\Config\ConfigException: Translate: Message group subscriptions (TranslateEnableMessageGroupSubscription) are enabled but Echo extension is not installed - https://phabricator.wikimedia.org/T381250 [11:15:25] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye [11:15:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:15:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:17:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10370833 (10Marostegui) Thank you Jenn [11:19:26] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2005 - jayme@cumin2002" [11:19:30] !log ladsgroup@deploy2002 abi, ladsgroup: Backport for [[gerrit:1099659|Translate: Disable message group subscription feature for legalteamwiki (T372386 T381250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:19:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2005 - jayme@cumin2002" [11:19:32] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:32] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2005.codfw.wmnet 40.32.192.10.in-addr.arpa 0.4.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:19:35] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2005.codfw.wmnet 40.32.192.10.in-addr.arpa 0.4.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:19:36] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2005 [11:19:37] !log ladsgroup@deploy2002 abi, ladsgroup: Continuing with sync [11:21:06] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool es2020 T381259', diff saved to https://phabricator.wikimedia.org/P71463 and previous config saved to /var/cache/conftool/dbconfig/20241202-112105-marostegui.json [11:21:09] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [11:22:39] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1005.eqiad.wmnet with OS bookworm [11:23:01] (03CR) 10Hashar: [C:03+1] "I have applied the patch on the integration project Puppet server and I have confirmed CI clones WikimediaMessages extension in half a sec" [puppet] - 10https://gerrit.wikimedia.org/r/1099657 (https://phabricator.wikimedia.org/T374717) (owner: 10Hashar) [11:23:22] !log rollback OSPF metric change on cr4-ulsfo to place all codfw to eqsin traffic back on primary transport link [11:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2005 [11:23:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2005 [11:24:21] (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099666 [11:24:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:26:19] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099659|Translate: Disable message group subscription feature for legalteamwiki (T372386 T381250)]] (duration: 11m 21s) [11:26:23] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [11:26:23] T381250: Uncaught MediaWiki\Config\ConfigException: Translate: Message group subscriptions (TranslateEnableMessageGroupSubscription) are enabled but Echo extension is not installed - https://phabricator.wikimedia.org/T381250 [11:26:52] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:29:27] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:39] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:33:11] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2006.codfw.wmnet with reason: host reimage [11:35:39] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:50] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2006.codfw.wmnet with reason: host reimage [11:37:50] (03CR) 10Marostegui: [C:03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1099666 (owner: 10Marostegui) [11:38:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2020.codfw.wmnet with reason: cloning [11:38:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2020.codfw.wmnet with reason: cloning [11:38:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1005.eqiad.wmnet with reason: host reimage [11:39:40] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:40:20] (03PS2) 10ArielGlenn: systemd job to create missing local accounts on loginwiki/metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1088552 (https://phabricator.wikimedia.org/T378401) [11:42:35] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2005.codfw.wmnet with reason: host reimage [11:42:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1005.eqiad.wmnet with reason: host reimage [11:42:56] (03PS1) 10Marostegui: mariadb: Productionize es2041 [puppet] - 10https://gerrit.wikimedia.org/r/1099667 (https://phabricator.wikimedia.org/T381259) [11:43:51] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2041 [puppet] - 10https://gerrit.wikimedia.org/r/1099667 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [11:44:12] (03PS1) 10Ladsgroup: mysql: Allow for multiinstance clone [cookbooks] - 10https://gerrit.wikimedia.org/r/1099668 [11:45:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1070.eqiad.wmnet with reason: vacuum two overlarge container dbs [11:46:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1070.eqiad.wmnet with reason: vacuum two overlarge container dbs [11:46:16] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10370940 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7d420931-1c23-4bd9-986f-551d7aa8fefc) set by mvernon@cumin... [11:46:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2005.codfw.wmnet with reason: host reimage [11:46:25] (03CR) 10ArielGlenn: [C:03+2] systemd job to create missing local accounts on loginwiki/metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1088552 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [11:50:17] (03CR) 10CI reject: [V:04-1] mysql: Allow for multiinstance clone [cookbooks] - 10https://gerrit.wikimedia.org/r/1099668 (owner: 10Ladsgroup) [11:51:25] 06SRE, 06Commons, 06Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 7 others: [C-DIS][SW] Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730#10370943 (10karapayneWMDE) [11:52:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be1070.eqiad.wmnet [11:52:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1070.eqiad.wmnet [11:54:08] (03PS4) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [11:54:37] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10371001 (10MatthewVernon) We saw similar with ms-be1070 alerting on disk space today, it had two 7G db files, corresponding to `wikipe... [11:54:37] (03CR) 10Cathal Mooney: "Thanks for the review, updated now." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [11:55:17] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [11:55:46] !log Stop mariadb on es2020 to clone es2041 T381259 [11:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:51] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [11:56:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2006.codfw.wmnet with OS bookworm [11:56:58] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts mc-gp2001.codfw.wmnet [11:57:31] !log upload mapnik 4.0.3+ds-2~wmf12u2 (adding a forward ported mapnik-config script to be consumed by node-mapnik even with the switch of mapnik 4 towards pkg-config) T327396 [11:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:34] T327396: Migrate Kartotherian to node-mapnik v4.2.1 and unfork - https://phabricator.wikimedia.org/T327396 [12:00:12] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371012 (10MatthewVernon) ...though on the contrary argument (for horizontal expansion), we've also seen nodes filling their relatively modest NICs. I came, here, though, to mention... [12:01:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1005012568 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:01:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1005.eqiad.wmnet with OS bookworm [12:02:16] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:03:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 186056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:03:19] (03PS1) 10Btullis: Revert "Upgrade the remainder of the cephosd cluster to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 [12:04:47] !log homer 'cr*eqiad*' commit 'T377876' [12:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:50] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [12:05:41] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc-gp2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [12:06:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2005.codfw.wmnet with OS bookworm [12:07:27] (03PS2) 10Wangombe: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) [12:07:31] (03PS2) 10Btullis: Revert "Upgrade the remainder of the cephosd cluster to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 (https://phabricator.wikimedia.org/T381264) [12:07:39] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 (https://phabricator.wikimedia.org/T381264) (owner: 10Btullis) [12:08:03] (03CR) 10Brouberol: [C:03+1] Revert "Upgrade the remainder of the cephosd cluster to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 (https://phabricator.wikimedia.org/T381264) (owner: 10Btullis) [12:08:08] (03CR) 10CI reject: [V:04-1] Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [12:08:45] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371048 (10Ladsgroup) If you do a vaccum on all container dbs for `wikipedia-commons-local-thumb.0x` that should save you a decent chunk already. ` root@ms-fe2009:~# swift stat -v -... [12:09:18] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [12:10:26] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371055 (10Ladsgroup) ugh, that's eqiad. I'm only cleaning codfw now, shall I start eqiad too? [12:11:26] (03PS5) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [12:12:39] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [12:13:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc-gp2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [12:13:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc-gp2001.codfw.wmnet [12:14:05] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174#10371068 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `mc-gp2001.codfw.wmnet` - mc-gp2001.codfw.wmnet (**... [12:17:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 234, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:18:10] (03PS3) 10Wangombe: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) [12:18:15] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2005-2006].codfw.wmnet [12:18:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2005-2006].codfw.wmnet [12:20:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:21:33] (03CR) 10Volans: [C:03+2] mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [12:22:09] !log re-routing traffic from Drmrs towards TECHLIB-TCZ - AS2852 - National Library of Technology, Prague, to avoid path via GEANT [12:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:21] (03PS1) 10Michael Große: surfacing: prepare tracking visible highlights [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099681 (https://phabricator.wikimedia.org/T377097) [12:23:28] (03PS1) 10Michael Große: fix(surfacing): change "Yes"-button to look like the "No"-button [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099682 (https://phabricator.wikimedia.org/T380296) [12:23:35] (03PS1) 10Michael Große: Show preview for the article to link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099683 (https://phabricator.wikimedia.org/T376680) [12:23:42] (03PS1) 10Michael Große: feat(surfacing): track performance metrics [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099684 (https://phabricator.wikimedia.org/T377097) [12:23:51] (03PS1) 10Michael Große: instrumentation(StructuredTaskSurfacer): add read mode interaction tracking [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099685 (https://phabricator.wikimedia.org/T377097) [12:23:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:27] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:23] (03PS3) 10Arnaudb: aliases.yaml: add db-clouddb-sanitzation [puppet] - 10https://gerrit.wikimedia.org/r/1090859 (https://phabricator.wikimedia.org/T366146) [12:27:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:30:31] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10371119 (10POMI-OLIYN) We have been waiting for this update for a really long time. It doesn't concern only existing pages (like... [12:34:01] (03Merged) 10jenkins-bot: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [12:42:14] (03Abandoned) 10Michael Große: surfacing: prepare tracking visible highlights [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099681 (https://phabricator.wikimedia.org/T377097) (owner: 10Michael Große) [12:42:20] (03Abandoned) 10Michael Große: fix(surfacing): change "Yes"-button to look like the "No"-button [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099682 (https://phabricator.wikimedia.org/T380296) (owner: 10Michael Große) [12:42:26] (03Abandoned) 10Michael Große: Show preview for the article to link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099683 (https://phabricator.wikimedia.org/T376680) (owner: 10Michael Große) [12:42:31] (03Abandoned) 10Michael Große: feat(surfacing): track performance metrics [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099684 (https://phabricator.wikimedia.org/T377097) (owner: 10Michael Große) [12:42:36] (03Abandoned) 10Michael Große: instrumentation(StructuredTaskSurfacer): add read mode interaction tracking [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099685 (https://phabricator.wikimedia.org/T377097) (owner: 10Michael Große) [12:45:16] (03PS1) 10Michael Große: Growth: enable temporary Surfacing Alpha on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) [12:46:52] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) (owner: 10Michael Große) [12:49:03] (03CR) 10Marostegui: [C:03+1] "I am not very keen on having hostnames there, but I think we can go for this for now." [puppet] - 10https://gerrit.wikimedia.org/r/1090859 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [12:50:01] (03CR) 10Volans: [C:03+2] mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 (owner: 10Volans) [12:50:18] (03CR) 10Arnaudb: [C:03+2] aliases.yaml: add db-clouddb-sanitzation [puppet] - 10https://gerrit.wikimedia.org/r/1090859 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [12:50:54] (03PS1) 10Michael Große: Prepare for surfacing structured tasks (squashed) [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099691 (https://phabricator.wikimedia.org/T379976) [12:52:57] (03PS47) 10Arnaudb: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) [12:54:53] (03CR) 10Arnaudb: "Ready for review, post I69f86ea1f5fe023116cae1d2fca4ac7565ec9821 and mysql_legacy renaming to mysql" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [12:57:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [12:58:24] (03CR) 10CI reject: [V:04-1] sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [13:01:07] PROBLEM - Disk space on rpki2003 is CRITICAL: DISK CRITICAL - free space: /var/lib/routinator/repository 163MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=rpki2003&var-datasource=codfw+prometheus/ops [13:01:09] (03Merged) 10jenkins-bot: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 (owner: 10Volans) [13:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P71467 and previous config saved to /var/cache/conftool/dbconfig/20241202-130132-root.json [13:01:53] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts mc-gp[2002-2003].codfw.wmnet [13:02:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:05:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:06:38] !log homer 'cr*eqiad*' commit 'T377876' [13:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:50] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [13:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:13:54] (03PS1) 10Stevemunene: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) [13:14:33] (03CR) 10CI reject: [V:04-1] global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [13:16:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P71469 and previous config saved to /var/cache/conftool/dbconfig/20241202-131638-root.json [13:17:16] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [13:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:33] (03PS2) 10Stevemunene: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) [13:18:10] (03CR) 10CI reject: [V:04-1] global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [13:18:53] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1005.eqiad.wmnet [13:18:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1005.eqiad.wmnet [13:19:02] (03PS2) 10Slyngshede: P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) [13:19:13] (03CR) 10CI reject: [V:04-1] P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [13:20:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:20:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:21:06] (03CR) 10Volans: "@moritz are you ok with this change?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [13:21:21] (03PS3) 10Slyngshede: P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) [13:21:38] (03PS3) 10Stevemunene: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) [13:21:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:21:53] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc-gp[2002-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [13:21:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:22:12] (03CR) 10CI reject: [V:04-1] global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [13:22:44] (03CR) 10Brouberol: [C:03+2] airflow: gather all egress rules under a single network policyt for each component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099642 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [13:22:46] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268 (10Jelto) 03NEW [13:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:24:12] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc-gp[2002-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [13:24:12] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:13] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc-gp[2002-2003].codfw.wmnet [13:24:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174#10371258 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `mc-gp[2002-2003].codfw.wmnet` - mc-gp2002.codfw.wm... [13:24:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4608/co" [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [13:25:48] (03CR) 10Slyngshede: P:idp enable JMX exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [13:26:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:27:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:29:58] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1018.eqiad.wmnet [13:30:18] (03PS1) 10Gergő Tisza: SUL3: Set $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099697 (https://phabricator.wikimedia.org/T377142) [13:30:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:30:25] (03PS6) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [13:30:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:30:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1018.eqiad.wmnet [13:31:04] (03Abandoned) 10Arnaudb: mariadb: add 12 new es hosts [puppet] - 10https://gerrit.wikimedia.org/r/1083758 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [13:31:05] (03CR) 10CI reject: [V:04-1] SUL3: Set $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099697 (https://phabricator.wikimedia.org/T377142) (owner: 10Gergő Tisza) [13:31:30] (03PS4) 10Stevemunene: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) [13:31:37] !log repacing kafka-main1003 in production with kafka-main1008 - T363214 [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:40] T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214 [13:31:43] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [13:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P71470 and previous config saved to /var/cache/conftool/dbconfig/20241202-133143-root.json [13:31:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [13:31:56] (03PS1) 10Jelto: Rename kubernetes1018 to wikikube-worker1006 [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) [13:32:53] (03PS7) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [13:33:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [13:34:09] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [13:34:23] (03PS1) 10Brouberol: airflow: grant task pods access to Kerberos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099699 (https://phabricator.wikimedia.org/T377926) [13:35:23] (03PS8) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [13:37:20] (03PS5) 10Stevemunene: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) [13:37:37] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh [13:37:52] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh [13:38:21] (03CR) 10Stevemunene: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099699 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [13:38:59] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099699 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [13:39:09] (03CR) 10JMeybohm: Rename kubernetes1018 to wikikube-worker1006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:39:33] (03CR) 10Brouberol: [C:03+2] airflow: grant task pods access to Kerberos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099699 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [13:39:51] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [13:40:05] (03PS2) 10Arnaudb: mariadb: add new ES hosts [puppet] - 10https://gerrit.wikimedia.org/r/1099696 (https://phabricator.wikimedia.org/T378143) [13:40:06] (03CR) 10Arnaudb: "following up on your highlight: https://phabricator.wikimedia.org/T378143#10320743" [puppet] - 10https://gerrit.wikimedia.org/r/1099696 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [13:40:27] (03PS2) 10Jelto: Rename kubernetes1018 to wikikube-worker1006 [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) [13:40:47] (03CR) 10Jelto: Rename kubernetes1018 to wikikube-worker1006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:41:59] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:42:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:42:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:44:14] PROBLEM - Kafka broker TLS certificate validity on kafka-main1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:44:18] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:44:19] PROBLEM - Kafka Broker Server #page on kafka-main1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [13:44:27] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:29] !incidents [13:44:30] 5498 (UNACKED) kafka-main1003/Kafka Broker Server (paged) [13:44:33] !ack 5498 [13:44:33] 5498 (ACKED) kafka-main1003/Kafka Broker Server (paged) [13:44:37] hallo effie :) [13:44:40] that's expected.. effie is doing some maintenance AFAIK [13:44:49] that is mee [13:44:53] but wtf alert manager [13:45:01] I added the damn thing [13:45:12] effie: now you owe jayme and me some beer in ATL [13:45:31] 🍻 [13:45:36] I am curious what is that silence I added [13:45:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:48] (03PS2) 10Gergő Tisza: SUL3: Set $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099697 (https://phabricator.wikimedia.org/T377142) [13:46:13] (03CR) 10Brouberol: [C:03+1] "Looks good. Let's hold off, as might not need this after all." [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [13:46:22] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:46:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P71471 and previous config saved to /var/cache/conftool/dbconfig/20241202-134648-root.json [13:47:34] effie: heh the kafka broker server pages are sent by icinga not prometheus :( [13:48:23] ok found it, bad copy/pasta for icinga downtime [13:48:35] downtime cookbook should address both? [13:48:37] I do own janis and valentin a beer fo [13:49:01] vgutierrez: no, my etherpad had the command for the ones I did on Thu [13:49:09] ack [13:49:20] so yes, pebcak, sorry [13:49:27] FIRING: [4x] SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:36] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1003,1008].eqiad.wmnet with reason: Hardware refresh [13:51:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1003,1008].eqiad.wmnet with reason: Hardware refresh [13:55:30] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10371355 (10MoritzMuehlenhoff) I have backported https://gitlab.com/mailman/mailman/-/commit/353a2adf55381c09bf7c2bc7ddb532a06e3934f8 and built... [13:55:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:01] !log jiji@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main1007.eqiad.wmnet [13:57:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main1007.eqiad.wmnet [13:59:11] (03CR) 10Pmiazga: [beta] SUL3: Enable by default on a few beta wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) (owner: 10Gergő Tisza) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1400). [14:00:05] Daimona and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] i can deploy today [14:00:19] sergi0: hey, around too? [14:00:21] o/ [14:00:22] Oh my I'd forgotten about the deployment [14:00:30] urbanecm: o/ [14:00:39] Daimona: should we skip yours then? [14:00:40] Can I, uh, go first since I gotta go in like 30 mins? :D [14:00:46] sure [14:00:58] !log removing ganeti1020 from active Ganeti nodes T378921 [14:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:00] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [14:01:03] (let me check what patch it is first though. lol.) [14:01:07] (03CR) 10Urbanecm: [C:03+2] Drop $wgWikimediaCampaignEventsEnableCommunityList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) (owner: 10Daimona Eaytoy) [14:01:17] (03CR) 10Gergő Tisza: [beta] SUL3: Enable by default on a few beta wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) (owner: 10Gergő Tisza) [14:01:35] Ah yes. Holy Mondays. [14:01:44] (03CR) 10Lucas Werkmeister (WMDE): Drop $wgWikimediaCampaignEventsEnableCommunityList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) (owner: 10Daimona Eaytoy) [14:01:56] (03Merged) 10jenkins-bot: Drop $wgWikimediaCampaignEventsEnableCommunityList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) (owner: 10Daimona Eaytoy) [14:02:19] (03PS1) 10Muehlenhoff: ganeti1020: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099701 [14:02:37] (03CR) 10Urbanecm: [C:03+2] Prepare for surfacing structured tasks (squashed) [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099691 (https://phabricator.wikimedia.org/T379976) (owner: 10Michael Große) [14:02:45] (03CR) 10Daimona Eaytoy: Drop $wgWikimediaCampaignEventsEnableCommunityList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) (owner: 10Daimona Eaytoy) [14:03:04] (03PS2) 10Gergő Tisza: [beta] SUL3: Enable by default on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) [14:03:18] PROBLEM - ganeti-confd running on ganeti1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:03:18] PROBLEM - ganeti-noded running on ganeti1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:03:20] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099233|Drop $wgWikimediaCampaignEventsEnableCommunityList (T380075)]] [14:03:22] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 4590MiB (1% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [14:03:22] T380075: Remove feature flag for the collaboration list - https://phabricator.wikimedia.org/T380075 [14:06:15] (03CR) 10Fabfur: [C:03+2] hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:06:29] (03CR) 10Fabfur: [C:03+2] benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:06:52] (03PS18) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [14:06:54] (03CR) 10Fabfur: [V:03+2 C:03+2] hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:07:07] FIRING: [13x] ProbeDown: Service ganeti1020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:48] !log urbanecm@deploy2002 urbanecm, daimona: Backport for [[gerrit:1099233|Drop $wgWikimediaCampaignEventsEnableCommunityList (T380075)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:51] (03CR) 10Abijeet Patro: Translate: Disable message group subscription feature for legalteamwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099659 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:07:55] Daimona: can you test? [14:08:11] (03CR) 10Muehlenhoff: [C:03+2] ganeti1020: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099701 (owner: 10Muehlenhoff) [14:08:55] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts mc-gp[1001-1003].eqiad.wmnet [14:09:50] yup [14:10:36] let me know how it looks :) [14:10:54] jouncebot: nowandnext [14:10:55] For the next 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1400) [14:10:55] In 2 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1630) [14:10:59] It works AFAICT! [14:11:05] great! [14:11:07] !log urbanecm@deploy2002 urbanecm, daimona: Continuing with sync [14:11:16] (03CR) 10Urbanecm: [C:03+2] [beta] SUL3: Enable by default on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) (owner: 10Gergő Tisza) [14:11:57] (03Merged) 10jenkins-bot: [beta] SUL3: Enable by default on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099334 (https://phabricator.wikimedia.org/T381095) (owner: 10Gergő Tisza) [14:13:24] tgr|away: your patch'll be deployed to beta automagically...whenever CI decides :) [14:16:28] thx urbanecm [14:16:32] np [14:17:57] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099233|Drop $wgWikimediaCampaignEventsEnableCommunityList (T380075)]] (duration: 14m 37s) [14:18:00] T380075: Remove feature flag for the collaboration list - https://phabricator.wikimedia.org/T380075 [14:18:05] Daimona: your patch's live [14:18:51] Nice, thank you :) [14:18:55] sergi0: should we first do an equivalent of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099690 at testwiki, before the pilots go? [14:19:38] Yes, that would be safer [14:19:58] (03Merged) 10jenkins-bot: Prepare for surfacing structured tasks (squashed) [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099691 (https://phabricator.wikimedia.org/T379976) (owner: 10Michael Große) [14:20:22] okay, let's do that [14:21:02] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274 (10RobH) 03NEW [14:21:21] (03PS1) 10Urbanecm: [Growth] testwiki: Enable Surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099703 (https://phabricator.wikimedia.org/T379976) [14:21:27] sergi0: uploaded ^^ [14:21:54] wait, is add a link enabled in testwiki? [14:21:59] (03CR) 10Urbanecm: [C:03+2] [Growth] testwiki: Enable Surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099703 (https://phabricator.wikimedia.org/T379976) (owner: 10Urbanecm) [14:22:07] FIRING: [13x] ProbeDown: Service ganeti1020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:12] sergi0: yes [14:22:15] but for 2% of users [14:22:15] * sergi0 checking config [14:22:21] (enwiki rollout strikes again) [14:22:28] oh right [14:22:35] we should really not do two releases for the same product at the same time next time :) [14:22:41] (03Merged) 10jenkins-bot: [Growth] testwiki: Enable Surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099703 (https://phabricator.wikimedia.org/T379976) (owner: 10Urbanecm) [14:22:43] `ge.utils.setUserVariant( 'control' )` should fix that [14:22:57] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10371490 (10RobH) [14:23:21] (03PS1) 10Volans: CHANGELOG: add changelogs for release v9.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1099704 [14:23:27] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10371481 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff (or @jhathaway): Please note the workflow for racking tasks has changed this fiscal year, and we now require th... [14:23:51] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099703|[Growth] testwiki: Enable Surfacing structured tasks (T379976)]], [[gerrit:1099691|Prepare for surfacing structured tasks (squashed) (T379976)]] [14:23:52] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes1018 to wikikube-worker1006 [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [14:23:54] T379976: Surfacing "Add a link" Structured Tasks: Alpha Release Plan and Release Task (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T379976 [14:24:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:24:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:24:54] (03CR) 10Jelto: [C:03+2] Rename kubernetes1018 to wikikube-worker1006 [puppet] - 10https://gerrit.wikimedia.org/r/1099698 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [14:25:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:25:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:26:10] (03PS1) 10Muehlenhoff: Add site.pp entry for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1099706 (https://phabricator.wikimedia.org/T381274) [14:27:02] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [14:27:31] !log urbanecm@deploy2002 migr, urbanecm: Backport for [[gerrit:1099703|[Growth] testwiki: Enable Surfacing structured tasks (T379976)]], [[gerrit:1099691|Prepare for surfacing structured tasks (squashed) (T379976)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:31] (03CR) 10Muehlenhoff: [C:03+2] Add site.pp entry for puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1099706 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [14:27:53] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371507 (10MatthewVernon) I don't think so just yet (not least because I'm a bit twitchy about impact on frontends); the issue arises because some DBs are way too large (2x their unv... [14:27:53] sergi0: can you test? [14:27:59] (at testwiki only for now) [14:28:00] sure [14:28:17] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1018 to wikikube-worker1006 [14:28:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:05] !log jiji@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:29:05] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc-gp[1001-1003].eqiad.wmnet [14:29:08] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:30:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10371514 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `mc-gp[1001-1003].eqiad.wmnet` - mc-gp1001.eqiad.wm... [14:30:34] I'm seeing a client error `TypeError: Failed to construct 'URL':`. The API returns `"//test.wikipedia.org/wiki/Special:Homepage`, no good. Let me try to fix that quickly [14:31:30] that doesn't seem like a good thing... [14:32:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10371537 (10MoritzMuehlenhoff) site.pp entry has been added, preseed config was already covered [14:32:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10371545 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [14:33:55] (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1003 with kafka-main1008 [puppet] - 10https://gerrit.wikimedia.org/r/1099707 (https://phabricator.wikimedia.org/T363214) [14:34:22] !log installing curl security updates [14:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] sergi0: i can reproduce that... [14:35:28] ...suggestion: let's move ahead for testwiki, fix, and we can backport the fix in the evening? [14:35:52] alright [14:36:27] !log urbanecm@deploy2002 migr, urbanecm: Continuing with sync [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:00] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099703|[Growth] testwiki: Enable Surfacing structured tasks (T379976)]], [[gerrit:1099691|Prepare for surfacing structured tasks (squashed) (T379976)]] (duration: 19m 08s) [14:43:05] T379976: Surfacing "Add a link" Structured Tasks: Alpha Release Plan and Release Task (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T379976 [14:44:31] !log jelto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:47:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10371597 (10elukey) 05Open→03Resolved Everything reimaged, we are good :) [14:48:40] (03PS1) 10Ssingh: 10.in-addr.arpa: remove empty include for $ORIGIN 8.65.@Z [dns] - 10https://gerrit.wikimedia.org/r/1099713 [14:49:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10371623 (10Jhancock.wm) Hey sorry i missed your message last week. I was on vacation. but making a note of the provisioning re-run w/ tags. Thanks! [14:50:17] (03CR) 10Ssingh: [C:03+2] 10.in-addr.arpa: remove empty include for $ORIGIN 8.65.@Z [dns] - 10https://gerrit.wikimedia.org/r/1099713 (owner: 10Ssingh) [14:50:35] (03CR) 10Fabfur: [C:03+1] 10.in-addr.arpa: remove empty include for $ORIGIN 8.65.@Z [dns] - 10https://gerrit.wikimedia.org/r/1099713 (owner: 10Ssingh) [14:50:37] !log running authdns-update for CR 1099713 [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:10] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [14:55:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:25] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:57:38] marostegui: should the s8 maintenance be mentioned at https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance btw? (it’s linked from https://replag.toolforge.org) [14:57:48] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371638 (10MatthewVernon) >>! In T379942#10371048, @Ladsgroup wrote: > If you do a vaccum on all container dbs for `wikipedia-commons-local-thumb.0x` that should save you a decent ch... [14:58:04] Lucas_WMDE: I will make the proper !log, I did log it but probably didn't use the key words [14:58:13] !log Deploy schema change on db1167 dbmaint eqiad - s8 sanitarium master, there will be days of lag in wikireplicas in s8 T367856 [14:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:58:22] Lucas_WMDE: ^ It should be captured in a few minutes [14:58:25] thanks! [14:58:35] Lucas_WMDE: thanks for letting me know [14:59:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:48] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1006 [15:00:08] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371642 (10Ladsgroup) oh no, it's still ongoing but 10% have been cleaned up. It'll take a while until it's fully done (also I'm running the clean up on all containers from 00 to 0f... [15:01:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1006 [15:01:10] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10371649 (10MatthewVernon) Oh, OK, cool, sorry I misunderstood you. I think ATM I'd not want to think about vacuuming a container until the deletion of that container has done. [15:01:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1018 to wikikube-worker1006 [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:26] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v9.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1099704 (owner: 10Volans) [15:03:30] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache kubernetes1018.eqiad.wmnet wikikube-worker1006.eqiad.wmnet on all recursors [15:03:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubernetes1018.eqiad.wmnet wikikube-worker1006.eqiad.wmnet on all recursors [15:03:54] (03PS2) 10Effie Mouzeli: site.pp: decomm mc-gp100[1-3], mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1099212 (https://phabricator.wikimedia.org/T381174) [15:05:15] (03PS2) 10Majavah: O:openstack: Drop codfw1dev db role [puppet] - 10https://gerrit.wikimedia.org/r/1094491 (https://phabricator.wikimedia.org/T369308) [15:05:15] (03PS2) 10Majavah: O:openstack: cloudweb: Drop nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1094492 (https://phabricator.wikimedia.org/T371378) [15:05:15] (03PS2) 10Majavah: nutcracker: Remove module (and related code) [puppet] - 10https://gerrit.wikimedia.org/r/1094493 [15:05:55] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: decomm mc-gp100[1-3], mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1099212 (https://phabricator.wikimedia.org/T381174) (owner: 10Effie Mouzeli) [15:06:22] (03CR) 10Majavah: [C:03+2] O:openstack: Drop codfw1dev db role [puppet] - 10https://gerrit.wikimedia.org/r/1094491 (https://phabricator.wikimedia.org/T369308) (owner: 10Majavah) [15:06:42] (03CR) 10Majavah: [C:03+2] O:openstack: cloudweb: Drop nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1094492 (https://phabricator.wikimedia.org/T371378) (owner: 10Majavah) [15:08:28] (03CR) 10Majavah: [C:03+2] nutcracker: Remove module (and related code) [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [15:08:33] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1006.eqiad.wmnet with OS bookworm [15:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:33] (03CR) 10Marostegui: [C:03+1] mariadb: add new ES hosts [puppet] - 10https://gerrit.wikimedia.org/r/1099696 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [15:12:18] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v9.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1099704 (owner: 10Volans) [15:12:43] (03PS1) 10Krinkle: webperf: set statsv.py --statsd to statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) [15:13:27] (03PS1) 10Volans: Upstream release v9.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1099721 [15:14:08] jouncebot: nowandnext [15:14:08] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [15:14:09] In 1 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1630) [15:14:36] (03PS2) 10Krinkle: webperf: set statsv.py --statsd to statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) [15:14:51] !log sudo cumin "A:cp" 'disable-puppet "merging CR 1091748"' [trafficserver: remove inbound TLS and related settings] [15:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:32] (03CR) 10Ssingh: [V:03+1 C:03+2] trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [15:17:14] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [reason: testing CR 1091748] [15:17:26] (03PS2) 10Majavah: wikitech: Drop contentadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088318 (https://phabricator.wikimedia.org/T375950) [15:17:30] jouncebot: nowandnext [15:17:30] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [15:17:30] In 1 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1630) [15:17:42] mvolz: are you deploying something? [15:18:20] taavi: i was thinking of deploying something on k8 staging only, but people are reimaging today so thought better of it [15:18:38] not sure if they're done? [15:19:23] mvolz: no idea, but I see people deploying above just fine [15:20:01] taavi: did you want to use the window? go ahead and i can go after you're done. [15:20:25] afaik no re-imaging of k8 staging is happening, only for the production wikikube kubernetes clusters. But this should not impact deployments or availability [15:20:33] yea I have a mw config patch. thanks [15:20:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088318 (https://phabricator.wikimedia.org/T375950) (owner: 10Majavah) [15:22:40] (03Merged) 10jenkins-bot: wikitech: Drop contentadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088318 (https://phabricator.wikimedia.org/T375950) (owner: 10Majavah) [15:22:57] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1088318|wikitech: Drop contentadmin group (T375950)]] [15:22:59] T375950: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950 [15:24:12] (03CR) 10Andrew Bogott: [C:03+1] "Seems better, although I miss the clarity of _get_project_name_by_id" [puppet] - 10https://gerrit.wikimedia.org/r/1091850 (https://phabricator.wikimedia.org/T380095) (owner: 10Majavah) [15:24:14] (03CR) 10Krinkle: webperf: set statsv.py --statsd to statsd.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [15:24:34] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [reason: [done] testing CR 1091748] [15:25:40] (03PS1) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) [15:25:54] (03CR) 10Volans: [C:03+2] Upstream release v9.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1099721 (owner: 10Volans) [15:26:01] !log taavi@deploy2002 taavi: Backport for [[gerrit:1088318|wikitech: Drop contentadmin group (T375950)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:26:07] !log taavi@deploy2002 taavi: Continuing with sync [15:28:21] (03CR) 10CDanis: [C:03+1] ml-lab: Allow users to run nvtop and radeontop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478 (owner: 10Klausman) [15:28:59] (03CR) 10Muehlenhoff: [C:03+1] "Approved in the weekly SRE IF meeting, please go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/1098478 (owner: 10Klausman) [15:29:53] !log sudo cumin -b1 -s10 "A:cp" 'run-puppet-agent --enable "merging CR 1091748"' [15:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:39] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088318|wikitech: Drop contentadmin group (T375950)]] (duration: 09m 42s) [15:32:47] T375950: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950 [15:35:49] (03CR) 10Klausman: [V:03+2 C:03+2] ml-lab: Allow users to run nvtop and radeontop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478 (owner: 10Klausman) [15:36:14] (03Merged) 10jenkins-bot: Upstream release v9.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1099721 (owner: 10Volans) [15:36:27] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:36:32] (03CR) 10Neil Shah-Quinn (WMF): "Thank you for the notification @gtisza@wikimedia.org!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [15:36:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:17] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:39:01] (03CR) 10Mvolz: [C:03+2] Increment chart for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099646 (https://phabricator.wikimedia.org/T369084) (owner: 10Mvolz) [15:39:40] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:40:22] (03Merged) 10jenkins-bot: Increment chart for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099646 (https://phabricator.wikimedia.org/T369084) (owner: 10Mvolz) [15:42:02] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [15:42:06] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [15:42:22] !log uploaded spicerack_9.0.0 to apt.wikimedia.org bullseye-wikimedia [15:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:13] (03CR) 10Arnaudb: [C:03+2] mariadb: add new ES hosts [puppet] - 10https://gerrit.wikimedia.org/r/1099696 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [15:46:49] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [15:47:06] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [15:47:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10371973 (10ABran-WMF) Thanks @Jclark-ctr! puppet patch pas been merged [15:50:40] (03PS1) 10Brouberol: Airflow: add comments exlaining the external services requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099735 (https://phabricator.wikimedia.org/T377926) [15:50:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [15:51:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10371983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [15:52:06] (03CR) 10JHathaway: [C:03+1] vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [15:54:27] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:53] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1025.eqiad.wmnet with OS bullseye [15:59:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10372010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:... [15:59:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10372004 (10Jhancock.wm) [15:59:23] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025.eqiad.wmnet'] [16:00:09] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025.eqiad.wmnet'] [16:00:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025.eqiad.wmnet'] [16:00:45] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025.eqiad.wmnet'] [16:04:09] (03PS2) 10Brouberol: Airflow: add comments explaining the external services rationales [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099735 (https://phabricator.wikimedia.org/T377926) [16:06:17] (03PS1) 10Urbanecm: ApiQueryLinkRecommendations: Do not use relative protocol URIs [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099736 (https://phabricator.wikimedia.org/T381277) [16:08:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: wdqs1025 fails to PXE boot - https://phabricator.wikimedia.org/T381283 (10bking) 03NEW [16:13:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10372097 (10Jhancock.wm) @Marostegui we got these servers in today. I'm gonna try to get them ready asap. I wasn't able to confirm that the puppet updates had been added. Could you... [16:13:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10372099 (10Jhancock.wm) [16:21:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10372142 (10Jhancock.wm) [16:22:08] (03CR) 10Btullis: [C:03+1] "Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099735 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [16:23:59] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10372150 (10Ladsgroup) Sure, early Jan we can vaccum the first 16 containers. [16:24:35] (03PS2) 10Michael Große: Growth: enable temporary Surfacing Alpha on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099690 (https://phabricator.wikimedia.org/T379976) [16:25:57] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1006.eqiad.wmnet with OS bookworm [16:27:31] (03PS1) 10Bking: partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) [16:28:07] (03CR) 10Stevemunene: [C:03+1] Airflow: add comments explaining the external services rationales [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099735 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [16:28:16] (03CR) 10CI reject: [V:04-1] partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) (owner: 10Bking) [16:28:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10372173 (10MoritzMuehlenhoff) [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1630). [16:32:57] !log starting portals deploy [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:22] (03PS2) 10Bking: partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) [16:34:19] !log dancy@deploy2002 Installing scap version "4.130.0" for 207 hosts [16:34:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099742 (https://phabricator.wikimedia.org/T128546) [16:34:40] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:35:16] (03CR) 10Michael Große: [C:03+1] ApiQueryLinkRecommendations: Do not use relative protocol URIs [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099736 (https://phabricator.wikimedia.org/T381277) (owner: 10Urbanecm) [16:36:19] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [16:37:12] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099742 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099742 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:38:47] !log dancy@deploy2002 Installation of scap version "4.130.0" completed for 207 hosts [16:43:42] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174#10372242 (10jijiki) [16:44:52] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174#10372252 (10Reedy) [16:47:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: wdqs1025 fails to PXE boot, NIC shows "no link" in DRAC web UI - https://phabricator.wikimedia.org/T381283#10372239 (10bking) [16:51:48] (03CR) 10Volans: [C:03+2] sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [16:52:23] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1099742| Bumping portals to master (T128546)]] (duration: 10m 36s) [16:52:26] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:52:30] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [16:54:52] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1099742| Bumping portals to master (T128546)]] (duration: 02m 28s) [16:55:37] !log fabfur@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [16:57:40] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [17:00:20] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10372322 (10Jhancock.wm) cpu shipped and should arrive tomorrow. depending on time of delivery it will be replaced on tuesday or wednesday. [17:02:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10372328 (10jijiki) [17:03:21] !log fabfur@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [17:04:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10372332 (10jijiki) >>! In T381173#10371514, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `mc-gp[... [17:06:29] jouncebot: nowandnext [17:06:29] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [17:06:30] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1800) [17:06:30] In 0 hour(s) and 53 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1800) [17:06:34] (03CR) 10Urbanecm: [C:03+2] ApiQueryLinkRecommendations: Do not use relative protocol URIs [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099736 (https://phabricator.wikimedia.org/T381277) (owner: 10Urbanecm) [17:06:49] * MichaelG_WMF is also here to test [17:07:10] (03CR) 10Ebernhardson: [C:03+1] flink-app: add a component label to the flink-app configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885 (owner: 10DCausse) [17:07:21] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "missing data for wikikube-worker1006 - jayme@cumin1002" [17:07:28] !log resetting ulsfo->eqsin link to normal metric to put all codfw->eqsin traffic back on Aerlion cct [17:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "missing data for wikikube-worker1006 - jayme@cumin1002" [17:08:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381244#10372349 (10Jhancock.wm) 05Open→03Resolved [17:12:21] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:13:32] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1006.eqiad.wmnet with OS bookworm [17:16:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [17:19:09] dancy: i seem to have an infinite loop in scap in front of me. can that be a sideefffect of your deployment? [17:19:25] 07Puppet, 06cloud-services-team, 10Tools: Too many puppet facts on toolforge k8s workers - https://phabricator.wikimedia.org/T381293 (10Andrew) 03NEW [17:19:30] this did not move for 12 minutes https://www.irccloud.com/pastebin/s3MfgKjG/ [17:19:38] i stopped it, restarted, and it's stuck again [17:20:29] Hmm.. there's no message after that like "waiting for changes to be merged"? [17:20:39] dancy: no. the bot didn't even +2 the change [17:20:48] (03PS1) 10Michael Große: testwiki: no growth experiment anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) [17:20:50] (i did manually tho) [17:20:59] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) (owner: 10Michael Große) [17:21:33] maybe it is waiting for CI without telling me that [17:21:39] but then i'd expect it to +2 [17:22:08] hmm.. logging in to poke around [17:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:23:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10372451 (10jijiki) a:03VRiley-WMF [17:23:25] 07Puppet, 06cloud-services-team, 10Toolforge: Too many puppet facts on toolforge k8s workers - https://phabricator.wikimedia.org/T381293#10372452 (10taavi) [17:23:53] urbanecm: Please control-C your scap run. I want to run it myself. [17:23:54] (03CR) 10Majavah: [C:03+2] opnestack: keystone: Do not provision DNS zones for service projects [puppet] - 10https://gerrit.wikimedia.org/r/1091850 (https://phabricator.wikimedia.org/T380095) (owner: 10Majavah) [17:23:59] dancy: go ahead [17:24:18] thx [17:26:16] (03Merged) 10jenkins-bot: ApiQueryLinkRecommendations: Do not use relative protocol URIs [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099736 (https://phabricator.wikimedia.org/T381277) (owner: 10Urbanecm) [17:30:53] dancy: please let me know once it's time for me to try again [17:31:00] Will do [17:31:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1006.eqiad.wmnet with reason: host reimage [17:32:21] (03PS1) 10Andrew Bogott: Puppet agent: allow hiera config of number_of_facts_soft_limit [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) [17:32:31] FIRING: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:32:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdg) failed on moss-be2002 - https://phabricator.wikimedia.org/T381239#10372518 (10Jhancock.wm) a:03Jhancock.wm drive has been replaced. Looks good physically (no amber lights). Let us know if that fixed it, thanks! [17:33:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1006.eqiad.wmnet with reason: host reimage [17:33:56] (03CR) 10JMeybohm: [C:03+1] kafka-main: Replace kafka-main1003 with kafka-main1008 [puppet] - 10https://gerrit.wikimedia.org/r/1099707 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [17:37:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [17:37:31] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:38:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:38:23] (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1003 with kafka-main1008 [puppet] - 10https://gerrit.wikimedia.org/r/1099707 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [17:41:58] (03PS1) 10DLynch: cawiki: stop Flow being the default for some talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099750 (https://phabricator.wikimedia.org/T381295) [17:42:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdg) failed on moss-be2002 - https://phabricator.wikimedia.org/T381239#10372560 (10MatthewVernon) 05Open→03Resolved Yes, new disk added, all seems good. Thanks for the speedy swap :) [17:43:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2241-2 to codfw - jhancock@cumin2002" [17:43:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2241-2 to codfw - jhancock@cumin2002" [17:43:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:43:52] urbanecm: Sorry for the slowness, but I did find the source of the bug. It was related to the last scap release, which I'm just about to revert. [17:44:26] !log dancy@deploy2002 Installing scap version "4.129.0" for 207 hosts [17:44:51] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10372566 (10Dzahn) >>! In T377045#10371355, @MoritzMuehlenhoff wrote: > I have backported https://gitlab.com/mailman/mailman/-/commit/353a2adf55... [17:46:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2241.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2242.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:45] thanks! [17:48:44] !log dancy@deploy2002 Installation of scap version "4.129.0" completed for 207 hosts [17:48:51] !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [17:48:51] (03CR) 10Dzahn: [C:03+2] miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092827 (owner: 10Muehlenhoff) [17:49:54] urbanecm: Try now please [17:49:58] trying [17:50:13] now it works [17:50:25] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] [17:50:29] T381277: TypeError: Failed to construct 'URL': Invalid URL - https://phabricator.wikimedia.org/T381277 [17:50:51] great! looking forward to testing it [17:52:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1006.eqiad.wmnet with OS bookworm [17:52:31] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:54:21] !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [17:54:24] (03CR) 10Dzahn: [C:03+2] scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1098933 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [17:54:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:54:50] !log urbanecm@deploy2002 Sync cancelled. [17:56:47] (03PS20) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [17:56:47] (03PS1) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [17:57:27] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] [17:57:30] T381277: TypeError: Failed to construct 'URL': Invalid URL - https://phabricator.wikimedia.org/T381277 [17:57:31] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:58:35] !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [18:00:03] (03PS1) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1099754 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1800) [18:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T1800). [18:00:48] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:00:55] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:01:04] (03CR) 10Dzahn: [V:03+1 C:03+2] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1099754 (owner: 10Dzahn) [18:05:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:06:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 3.303% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:09:12] FIRING: [29x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:10:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 9.594s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:11:42] the errors are Database servers in cluster27 are overloaded [18:11:57] RESOLVED: [10x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:07] FIRING: [29x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 3.303% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:14:12] FIRING: [27x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:12] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:15:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:15:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 2.877s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:15:36] (03PS4) 10Volans: sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) [18:15:36] (03PS1) 10Volans: setup.py: temporary limit spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1099762 [18:15:39] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:15:41] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:15:50] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:15:52] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:16:26] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:16:28] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:16:44] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:16:46] !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [18:17:22] !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [18:17:23] !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [18:17:41] !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [18:17:42] !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [18:17:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2241.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:17:59] !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [18:18:01] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:18:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2242.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:19:16] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:23:15] (03CR) 10Volans: [C:03+2] setup.py: temporary limit spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1099762 (owner: 10Volans) [18:24:11] (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099763 (https://phabricator.wikimedia.org/T363214) [18:24:27] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:01] (03PS1) 10Ssingh: sre.roll-restart-reboot-wikimedia-dns: update aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 [18:28:00] (03CR) 10Scott French: [C:03+1] Update various kafka-main connection strings for kafka-main1008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099763 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [18:28:31] (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099763 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [18:28:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] [18:28:41] T381277: TypeError: Failed to construct 'URL': Invalid URL - https://phabricator.wikimedia.org/T381277 [18:28:52] MichaelG_WMF: i'm trying again ^^ [18:29:06] yay! [18:29:13] (03Merged) 10jenkins-bot: setup.py: temporary limit spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1099762 (owner: 10Volans) [18:29:14] (03CR) 10Dzahn: [C:03+2] vrts: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092838 (owner: 10Muehlenhoff) [18:29:23] (03Merged) 10jenkins-bot: sre.hosts.provision: disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [18:29:57] (03PS3) 10Ssingh: LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) [18:30:09] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [18:30:52] (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099763 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [18:31:09] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [18:31:30] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [18:31:48] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [18:32:05] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [18:32:06] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [18:32:11] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:32:20] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:32:31] proceeding, tested in the run before [18:32:36] (03CR) 10CI reject: [V:04-1] sre.roll-restart-reboot-wikimedia-dns: update aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 (owner: 10Ssingh) [18:33:05] (03CR) 10Ssingh: "CI error is ModuleNotFoundError: No module named 'spicerack.mysql_legacy'" [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 (owner: 10Ssingh) [18:33:05] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:33:32] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [18:33:52] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [18:33:53] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [18:34:14] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:34:31] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:34:32] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [18:34:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [18:34:58] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [18:35:07] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:35:36] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [18:35:55] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [18:35:56] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [18:36:33] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [18:37:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099362 (https://phabricator.wikimedia.org/T381214) (owner: 10NMW03) [18:37:04] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [18:38:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2241'] [18:38:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2242'] [18:39:14] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099736|ApiQueryLinkRecommendations: Do not use relative protocol URIs (T381277)]] (duration: 10m 35s) [18:39:17] T381277: TypeError: Failed to construct 'URL': Invalid URL - https://phabricator.wikimedia.org/T381277 [18:39:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2241'] [18:39:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2242'] [18:39:25] MichaelG_WMF: it's live now! [18:39:27] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:36] \o/ [18:39:39] * MichaelG_WMF tests [18:43:10] (03PS1) 10BCornwall: icinga: Remove RSA cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) [18:43:12] (03PS1) 10BCornwall: haproxy: Remove RSA certificate [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) [18:43:32] (03PS4) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 [18:44:04] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1093958/4610/" [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [18:45:18] jouncebot: next [18:45:18] In 2 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T2100) [18:46:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10372874 (10Jhancock.wm) [18:47:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4611/console" [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:49:03] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4612/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:50:13] MichaelG_WMF: how did your tests go? [18:50:22] (03CR) 10Brouberol: [C:03+2] Airflow: add comments explaining the external services rationales [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099735 (https://phabricator.wikimedia.org/T377926) (owner: 10Brouberol) [18:50:37] (03CR) 10Dzahn: tftpboot: squash puppetserver log warning. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [18:53:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:55:17] @urbanecm it works, and I can confirm the issue with it redirecting to desktop instead of mobile [18:55:34] MichaelG_WMF: thanks! good enough for now, i guess :) [18:55:54] okay, /me afk now (i'll be back for the next window if needed) [18:56:03] (03CR) 10Ssingh: haproxy: Remove RSA certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:56:42] (03PS2) 10Brouberol: postgresql-airflow-wmde: add helmfiles and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099196 (https://phabricator.wikimedia.org/T380613) [18:56:43] (03PS2) 10Brouberol: airflow-wmde: point to the cloudnative-pg cluster in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099197 (https://phabricator.wikimedia.org/T380613) [18:57:14] (03CR) 10CDobbins: "These are the biggest changes in the diff that I noticed:" [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins) [18:58:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:58:52] (03PS2) 10Ssingh: sre.roll-restart-reboot-wikimedia-dns: update aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 [18:59:07] (03CR) 10Brouberol: [C:03+1] "👍 Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1099190 (owner: 10Muehlenhoff) [18:59:22] (03PS1) 10Xcollazo: Revert "[dumps] - Categorise labswiki (wikitech) as a big wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1099771 (https://phabricator.wikimedia.org/T380729) [19:00:08] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1099195 (owner: 10Muehlenhoff) [19:01:16] !log volans@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:01:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10372959 (10VRiley-WMF) [19:02:23] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [19:03:13] (03PS3) 10Krinkle: webperf: set statsv.py --statsd to statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) [19:03:24] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [19:06:40] (03CR) 10Ssingh: "Thank you volans and my bad for not rebasing but I thought I did :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1099764 (owner: 10Ssingh) [19:07:06] !log disable puppet on A:lvs to finish rolling out CR 1093958: T358260 [19:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:08] T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260 [19:09:25] (03CR) 10Ssingh: [C:03+2] LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [19:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:56] (03PS1) 10Volans: sre.hosts.provision: fix disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1099772 [19:09:58] (03PS1) 10Ebernhardson: cirrus: Configure MLR buckets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099773 (https://phabricator.wikimedia.org/T377128) [19:11:13] jouncebot: nowandnext [19:11:13] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [19:11:13] In 1 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T2100) [19:11:27] !log dancy@deploy2002 Installing scap version "4.131.0" for 207 hosts [19:13:57] !log rebooting lvs3010 to test CR 1093958 [19:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:51] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet [19:15:59] !log dancy@deploy2002 Installation of scap version "4.131.0" completed for 207 hosts [19:16:03] (03PS2) 10BCornwall: icinga: Remove RSA cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) [19:16:03] (03PS2) 10BCornwall: haproxy: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) [19:16:14] (03CR) 10BCornwall: haproxy: Remove RSA certificate support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:17:09] (03CR) 10BCornwall: "Technically, one could revisit the underlying `check_ssl` perl script and remove the DSS/RSA checks but I'd rather leave that lest a drago" [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:17:14] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:20] ^ expected [19:17:23] lvs3010 reboot [19:19:36] (03CR) 10Volans: [C:03+2] sre.hosts.provision: fix disable Intel SGX on Dell [cookbooks] - 10https://gerrit.wikimedia.org/r/1099772 (owner: 10Volans) [19:20:14] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:19] !log volans@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:21:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3010.esams.wmnet [19:22:16] !log volans@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:22:26] (03CR) 10Btullis: [C:03+1] postgresql-airflow-wmde: add helmfiles and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099196 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [19:22:59] (03CR) 10Btullis: [C:03+1] airflow-wmde: point to the cloudnative-pg cluster in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099197 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [19:22:59] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:28:30] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4613/console" [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:28:47] 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#10373105 (10cmooney) 05Open→03Resolved a:03cmooney Happy we managed to get this sorted in the end and have the right settings on our LVS now... [19:28:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099750 (https://phabricator.wikimedia.org/T381295) (owner: 10DLynch) [19:29:09] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4614/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:30:43] (03CR) 10Ssingh: [C:03+1] "LGTM! Glad to see this closed." [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:40:55] (03CR) 10Ssingh: icinga: Remove RSA cert monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:42:44] (03PS1) 10BCornwall: icinga: Remove unused check_ssl_unified config [puppet] - 10https://gerrit.wikimedia.org/r/1099782 [19:43:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10373213 (10MoritzMuehlenhoff) It's in /home/jmm/import on apt1002 (but not yet imported since untested) [19:44:13] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4615/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall) [19:51:42] PROBLEM - statsv Varnishkafka log producer on cp3069 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:51:50] PROBLEM - Webrequests Varnishkafka log producer on cp3069 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:52:12] PROBLEM - eventlogging Varnishkafka log producer on cp3069 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:52:25] (03PS3) 10BCornwall: icinga: Remove RSA cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) [19:52:25] (03PS3) 10BCornwall: haproxy: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1099769 (https://phabricator.wikimedia.org/T370837) [19:52:25] (03PS2) 10BCornwall: icinga: Remove unused check_ssl_unified config [puppet] - 10https://gerrit.wikimedia.org/r/1099782 [19:52:32] huh [19:55:54] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3069.esams.wmnet [reason: checking icinga alerts] [19:56:21] (03PS1) 10Cwhite: prometheus-k8s: drop high cardinality labels [puppet] - 10https://gerrit.wikimedia.org/r/1099784 (https://phabricator.wikimedia.org/T381317) [19:58:44] RECOVERY - eventlogging Varnishkafka log producer on cp3069 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:59:57] (03CR) 10BCornwall: icinga: Remove RSA cert monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:01:44] RECOVERY - statsv Varnishkafka log producer on cp3069 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:02:44] RECOVERY - Webrequests Varnishkafka log producer on cp3069 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:10:20] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3069.esams.wmnet [reason: done: checking icinga alerts] [20:12:51] (03CR) 10Ssingh: [C:03+1] "+1 if PCC passes; probably better to check for PS3." [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:13:20] (03PS3) 10BCornwall: icinga: Remove unused check_ssl_unified config [puppet] - 10https://gerrit.wikimedia.org/r/1099782 [20:15:35] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10373389 (10KFrancis) Hi all, please send me Suzanne Wood's email address and I will process the NDA. Thanks! [20:15:57] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4616/console" [puppet] - 10https://gerrit.wikimedia.org/r/1099768 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:18:10] (03PS1) 10Jdrewniak: Rerunning Web browser extension survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099790 (https://phabricator.wikimedia.org/T380778) [20:20:11] (03PS2) 10JHathaway: vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [20:22:11] (03CR) 10Gmodena: [C:03+1] "Neat!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885 (owner: 10DCausse) [20:22:25] (03CR) 10CI reject: [V:04-1] vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [20:25:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10373420 (10phaultfinder) [20:27:06] (03CR) 10Bking: [C:03+2] Revert "[dumps] - Categorise labswiki (wikitech) as a big wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1099771 (https://phabricator.wikimedia.org/T380729) (owner: 10Xcollazo) [20:28:02] (03CR) 10Gmodena: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [20:28:39] (03PS3) 10JHathaway: vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [20:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:30:25] (03PS1) 10BCornwall: varnish: Remove RSA deprecation warning page [puppet] - 10https://gerrit.wikimedia.org/r/1099791 (https://phabricator.wikimedia.org/T370837) [20:30:57] (03PS1) 10CDobbins: lvs: add prometheus::node_ferm_mss to ipip.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) [20:31:35] (03CR) 10CI reject: [V:04-1] lvs: add prometheus::node_ferm_mss to ipip.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) (owner: 10CDobbins) [20:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 801.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:34:02] (03CR) 10Ssingh: [C:03+1] "(if tests pass, looks good!)" [puppet] - 10https://gerrit.wikimedia.org/r/1099791 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:35:00] (03CR) 10BCornwall: [V:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1099791 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 18.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 13.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:36:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 2168668672 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:36:51] (03CR) 10JHathaway: [C:03+1] partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) (owner: 10Bking) [20:38:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083434 (https://phabricator.wikimedia.org/T377531) (owner: 10SD0001) [20:38:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10373481 (10phaultfinder) [20:38:49] (03PS1) 10Urbanecm: fix(surfacing): don't redirect to desktop [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099793 [20:39:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099793 (owner: 10Urbanecm) [20:39:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) (owner: 10Michael Große) [20:39:28] (03CR) 10JHathaway: "modified the patch a bit, could you give it a review Arnold?" [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [20:40:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:41:32] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:41:32] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:42:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 908.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:43:59] (03PS2) 10CDobbins: lvs: add prometheus::node_ferm_mss to ipip.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) [20:44:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:46:32] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:55:04] (03PS1) 10Cwhite: webperf: disable statsd-exporter relaying flag [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) [20:56:32] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:58:17] (03CR) 10Michael Große: [C:03+1] fix(surfacing): don't redirect to desktop [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099793 (owner: 10Urbanecm) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T2100). [21:00:05] Nemoralis, kemayo, sd0001, and urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] o/ [21:00:16] o/ [21:00:18] i can deploy today [21:00:24] o/ [21:00:28] (03CR) 10Urbanecm: [C:03+2] fix(surfacing): don't redirect to desktop [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099793 (owner: 10Urbanecm) [21:01:02] Nemoralis: did you please seek a +1 from the Editing team? [21:01:29] or Kemayo if you're willing to +1 VisualEditor by default on Indonesian Wikiquote (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099362) now :)) [21:01:32] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [21:01:32] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [21:01:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1445946640 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:01:44] urbanecm: do we need that? I have done this before [21:02:14] https://phabricator.wikimedia.org/T369342 [21:02:55] I think we're fine with it. It looks like it was requested, and we don't have any objections to VE on wikiquotes in general. [21:03:03] (03CR) 10DLynch: [C:03+1] Enable VisualEditor by default on Indonesian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099362 (https://phabricator.wikimedia.org/T381214) (owner: 10NMW03) [21:03:48] Nemoralis: they at least used to want to take a look beforehand, just in case [21:04:02] but in this case, Kemayo +1'ed it, so let's go ahead [21:04:04] (thanks David!) [21:04:19] that was for enabling VE on talk pages, iirc [21:04:21] (03CR) 10Urbanecm: [C:03+2] votewiki, testwiki: add securepoll-edit-poll to electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083434 (https://phabricator.wikimedia.org/T377531) (owner: 10SD0001) [21:04:30] (03CR) 10Urbanecm: [C:03+2] Enable VisualEditor by default on Indonesian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099362 (https://phabricator.wikimedia.org/T381214) (owner: 10NMW03) [21:04:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:04:53] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:05:07] (03Merged) 10jenkins-bot: votewiki, testwiki: add securepoll-edit-poll to electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083434 (https://phabricator.wikimedia.org/T377531) (owner: 10SD0001) [21:05:11] (03Merged) 10jenkins-bot: Enable VisualEditor by default on Indonesian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099362 (https://phabricator.wikimedia.org/T381214) (owner: 10NMW03) [21:05:15] Generally we want to keep an eye on it in case someone wants to turn it on somewhere that it won't work well. Talk pages are a good example, indeed. For wikiquotes, it's fine. [21:06:30] I remember this https://phabricator.wikimedia.org/T359815#9622244 [21:06:41] and this https://phabricator.wikimedia.org/T359815#10017307 [21:07:09] (03CR) 10Urbanecm: [C:03+2] cawiki: stop Flow being the default for some talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099750 (https://phabricator.wikimedia.org/T381295) (owner: 10DLynch) [21:07:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099750 (https://phabricator.wikimedia.org/T381295) (owner: 10DLynch) [21:07:54] (03Merged) 10jenkins-bot: cawiki: stop Flow being the default for some talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099750 (https://phabricator.wikimedia.org/T381295) (owner: 10DLynch) [21:08:10] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099362|Enable VisualEditor by default on Indonesian Wikiquote (T381214)]], [[gerrit:1083434|votewiki, testwiki: add securepoll-edit-poll to electionadmin (T377531)]], [[gerrit:1099750|cawiki: stop Flow being the default for some talk namespaces (T381295)]] [21:08:25] T381214: Enable VisualEditor by default in Indonesian Wikiquote - https://phabricator.wikimedia.org/T381214 [21:08:25] T377531: Ability to edit polls should be unbundled from securepoll-view-voter-pii / electionadmin - https://phabricator.wikimedia.org/T377531 [21:08:25] T381295: Disabling Flow in cawiki 'Viquiprojecte Discussió' namespace - https://phabricator.wikimedia.org/T381295 [21:09:26] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1083 - vriley@cumin1002" [21:09:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be1083 - vriley@cumin1002" [21:09:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:45] (03CR) 10Krinkle: [C:03+1] webperf: disable statsd-exporter relaying flag [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [21:10:52] Kemayo: by any chance, do you have any thoughts re T381291? [21:10:53] T381291: Using `require` in shell.php fails in production (Error Class 'PhpParser\Node\Scalar\LNumber' not found) - https://phabricator.wikimedia.org/T381291 [21:12:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:12:37] !log urbanecm@deploy2002 kemayo, urbanecm, nmw03, sd: Backport for [[gerrit:1099362|Enable VisualEditor by default on Indonesian Wikiquote (T381214)]], [[gerrit:1083434|votewiki, testwiki: add securepoll-edit-poll to electionadmin (T377531)]], [[gerrit:1099750|cawiki: stop Flow being the default for some talk namespaces (T381295)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:12:55] Kemayo: Nemoralis: sd0001: your patches are available for testing at mwdebug, please take a look [21:13:28] LGTM [21:13:56] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10373612 (10Dzahn) I installed the package manually on the passive host, lists2001. [21:13:59] ty [21:14:04] LGTM [21:14:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:42] ty [21:14:54] urbanecm: looks fine [21:15:02] ty, proceeding [21:15:06] !log urbanecm@deploy2002 kemayo, urbanecm, nmw03, sd: Continuing with sync [21:16:03] urbanecm: No ideas about the shell.php issue, though. I'd have thought it was some sort of dependencies issue, but nothing has touched that particular one since Nov 10th, and I'd think this would have been noticed before now if that was it. [21:16:26] (It *was* a whole major-version update then, though. 4.10.2 to 5.3.1. So...) [21:16:49] See: T379508 [21:16:49] T379508: Upgrade nikic/php-parser to ^5 - https://phabricator.wikimedia.org/T379508 [21:17:00] yeah... what is puzzling me is that this is only broken in production. locally, works perfectly fine, and i double checked my versions [21:18:39] (03CR) 10BCornwall: "Indeed. Once Iae8ce48dd7addc8c0d13c85488964173a1033f23 is in these will not be proposed." [puppet] - 10https://gerrit.wikimedia.org/r/1099313 (owner: 10Ncmonitor) [21:18:42] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1099313 (owner: 10Ncmonitor) [21:18:48] Some transitory glitch in composer update that's now passed, perhaps? I'm not familiar with how we rebuild vendor in deployments. [21:19:16] I know we don't commit composer.lock, so there's plenty of room for there to be different dependency-of-dependency versions in play. [21:19:39] (03Merged) 10jenkins-bot: fix(surfacing): don't redirect to desktop [extensions/GrowthExperiments] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099793 (owner: 10Urbanecm) [21:19:39] Kemayo: we basically commit `vendor` to git, and use that in prod [21:20:05] see https://github.com/wikimedia/mediawiki-vendor [21:21:02] (03PS2) 10Michael Große: testwiki: no growth experiment anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) [21:21:05] (03CR) 10Urbanecm: [C:03+2] testwiki: no growth experiment anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) (owner: 10Michael Große) [21:21:38] MichaelG_WMF: fyi, if still around [21:21:45] * MichaelG_WMF looks [21:21:47] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10373647 (10Aklapper) @POMI-OLIYN: See the "Details" box above linking to the last comments in the proposed patches [21:21:50] (03Merged) 10jenkins-bot: testwiki: no growth experiment anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099746 (https://phabricator.wikimedia.org/T380659) (owner: 10Michael Große) [21:21:50] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099362|Enable VisualEditor by default on Indonesian Wikiquote (T381214)]], [[gerrit:1083434|votewiki, testwiki: add securepoll-edit-poll to electionadmin (T377531)]], [[gerrit:1099750|cawiki: stop Flow being the default for some talk namespaces (T381295)]] (duration: 13m 40s) [21:21:55] T381214: Enable VisualEditor by default in Indonesian Wikiquote - https://phabricator.wikimedia.org/T381214 [21:21:55] T377531: Ability to edit polls should be unbundled from securepoll-view-voter-pii / electionadmin - https://phabricator.wikimedia.org/T377531 [21:21:56] T381295: Disabling Flow in cawiki 'Viquiprojecte Discussió' namespace - https://phabricator.wikimedia.org/T381295 [21:22:06] thanks urbanecm [21:22:16] Kemayo: anyway, live in prod now :) [21:22:19] sd0001: also live [21:22:21] np Nemoralis! [21:22:38] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1099746|testwiki: no growth experiment anymore (T380659)]], [[gerrit:1099793|fix(surfacing): don't redirect to desktop]] [21:22:40] T380659: Surfacing "Add a link" Structured Tasks: Test Wikipedia release (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T380659 [21:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:22:49] urbanecm: thanks! [21:22:52] any time [21:23:11] urbanecm: though I'm not sure how to verify the changes disabling the old experiment [21:23:26] (probably not really necessary either) [21:23:43] MichaelG_WMF: if you never fiddled with `ge.utils.setUserVariant`, then add link should suddenly re-appear [21:23:56] but yeah, the redirect's more important [21:27:36] !log urbanecm@deploy2002 migr, urbanecm: Backport for [[gerrit:1099746|testwiki: no growth experiment anymore (T380659)]], [[gerrit:1099793|fix(surfacing): don't redirect to desktop]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:43] MichaelG_WMF: available at debug now [21:27:52] * MichaelG_WMF is looking [21:28:14] i verified the experiment is disabled [21:28:43] and redirect also looks good [21:28:50] waiting on MichaelG_WMF's review [21:29:14] seems to work fine for me! [21:29:20] great! let's sync then [21:29:23] !log urbanecm@deploy2002 migr, urbanecm: Continuing with sync [21:29:45] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:35:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:35:06] (03PS3) 10BCornwall: ncmonitor: Add "main" WMF domains to ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) [21:35:06] (03PS3) 10BCornwall: ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 [21:35:06] (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [21:36:00] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099746|testwiki: no growth experiment anymore (T380659)]], [[gerrit:1099793|fix(surfacing): don't redirect to desktop]] (duration: 13m 22s) [21:36:04] T380659: Surfacing "Add a link" Structured Tasks: Test Wikipedia release (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T380659 [21:36:07] MichaelG_WMF: and live! [21:36:09] anything else [21:36:10] ? [21:36:36] urbanecm: frwiki, arzwiki and eswiki? [21:36:46] or is that also already live? [21:36:57] MichaelG_WMF: not yet, but i thought we wanted to give Kirsten a chance to review? [21:38:22] * MichaelG_WMF might have missed a message here or two [21:38:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4618/co" [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [21:38:42] mh, gotcha. I'll ping Kirsten. Though this will maybe push our "release" to tomorrow [21:39:10] true, but that's unlikely to matter much. [21:39:35] (03CR) 10BCornwall: "Good idea, thanks! I added that list. The others are indeed not part of the canonical domains but they're still returned by MarkMonitor." [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [21:40:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4619/co" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [21:41:44] (03CR) 10BCornwall: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [21:43:07] (03CR) 10BCornwall: [C:03+2] ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [21:43:20] (03CR) 10BCornwall: [V:03+1 C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [21:43:37] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [21:45:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:49:15] (03PS4) 10BCornwall: ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 [21:49:15] (03PS4) 10BCornwall: ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 [21:49:47] (03PS6) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [21:50:11] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [21:51:23] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [21:51:54] (03CR) 10BCornwall: [V:03+2 C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [21:55:13] (03CR) 10BCornwall: [C:04-2] Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot) [21:55:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:55:34] (03PS6) 10Volans: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 [21:55:34] (03PS5) 10Volans: Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 [21:55:34] (03PS3) 10Volans: cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 [21:56:41] (03CR) 10Volans: "mypy failure in CI is expected as the fix in I2d5bc3e26c537acc14e282d9ad23c271c2dba5cd is included in spicerack 9.0.0. Will be fixed by th" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241202T2200). [22:01:43] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [22:04:34] (03CR) 10Kgraessle: [C:03+1] "Looks good to me, thanks for your work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [22:05:27] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:05:37] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:16:10] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:16:13] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:01] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:18:11] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:19] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:21] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:24:55] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:25:05] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:26:59] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:33] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:43] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1083.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:34:25] (03PS3) 10Bernard Wang: Reenable non-UI experiment quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) [22:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10373883 (10phaultfinder) [22:34:38] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1094705 (https://phabricator.wikimedia.org/T380667) (owner: 10Pppery) [22:36:22] (03PS4) 10Bernard Wang: Reenable non-UI experiment quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) [22:36:28] (03CR) 10Bernard Wang: Reenable non-UI experiment quick survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [22:37:08] (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1094705 (https://phabricator.wikimedia.org/T380667) (owner: 10Pppery) [22:37:48] RESOLVED: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:39:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:27] 06SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120#10373944 (10BCornwall) a:05BCornwall→03None [22:55:46] (03CR) 10Jdlrobson: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [23:05:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Heavy usage (possible scraping) of ceb.wikipedia.org from AS54801 (Dec 2 2024) - https://phabricator.wikimedia.org/T381347 (10cmooney) 03NEW p:05Triage→03High [23:06:26] (03CR) 10Cwhite: "Hi folks. Does this patch look ready to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [23:06:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Heavy usage (possible scraping) of ceb.wikipedia.org from AS54801 (Dec 2 2024) - https://phabricator.wikimedia.org/T381347#10374057 (10cmooney) [23:09:27] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:27:55] FIRING: MaxConntrack: Max conntrack at 94.29% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:30:26] (03PS11) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [23:30:26] (03PS2) 10BryanDavis: deployment-prep: Add PHP 8.1 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1098647 (https://phabricator.wikimedia.org/T378752) [23:32:55] RESOLVED: MaxConntrack: Max conntrack at 94.29% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:34:33] (03Abandoned) 10BryanDavis: deployment-prep: Add PHP 8.1 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1098647 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [23:38:59] (03PS1) 10Cwhite: webperf: set statsd exporter timer type to histogram [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) [23:41:10] (03CR) 10BryanDavis: "mutante: adding you as a reviewer because you were the person who did the Ic48b86117ff9535cf2ca522673492461f7f1f708 for php8.2 that this r" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [23:44:59] (03PS1) 10Cwhite: prometheus: restart statsd-exporter on config change [puppet] - 10https://gerrit.wikimedia.org/r/1099822 (https://phabricator.wikimedia.org/T355837) [23:46:41] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:49:10] (03CR) 10Cwhite: "I propose we put I3b82901b3b263bec79589bf42cbdb88ca179e899 before this one so that statsd-exporter is restarted by the config change to pr" [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [23:50:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [23:50:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [23:50:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:50:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:58:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART