[00:07:36] (03PS1) 10Jdlrobson: WIP: Deploy dark mode to all logged-in users on the Vector2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T367150) [00:07:38] (03PS1) 10Jdlrobson: Enable dark mode for anonymous users on ready wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) [00:08:17] (03CR) 10CI reject: [V:04-1] Enable dark mode for anonymous users on ready wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [00:40:03] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P65500 and previous config saved to /var/cache/conftool/dbconfig/20240627-004042-marostegui.json [00:54:14] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:55:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367856)', diff saved to https://phabricator.wikimedia.org/P65501 and previous config saved to /var/cache/conftool/dbconfig/20240627-005549-marostegui.json [00:55:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [00:55:56] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:56:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [00:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T367856)', diff saved to https://phabricator.wikimedia.org/P65502 and previous config saved to /var/cache/conftool/dbconfig/20240627-005613-marostegui.json [01:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:06:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:00] (03PS2) 10RLazarus: deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) [03:00:24] (03CR) 10RLazarus: deployment_server: Add a mwscript-k8s cleanup script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T364069)', diff saved to https://phabricator.wikimedia.org/P65503 and previous config saved to /var/cache/conftool/dbconfig/20240627-031023-marostegui.json [03:10:39] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:25:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P65504 and previous config saved to /var/cache/conftool/dbconfig/20240627-032530-marostegui.json [03:40:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P65505 and previous config saved to /var/cache/conftool/dbconfig/20240627-034037-marostegui.json [03:55:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T364069)', diff saved to https://phabricator.wikimedia.org/P65506 and previous config saved to /var/cache/conftool/dbconfig/20240627-035544-marostegui.json [03:55:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [03:55:50] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:55:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [04:03:59] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Connect - Tele2, AS1257/IPv6: Connect - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:07:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) (owner: 10Dreamrimmer) [04:24:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:32:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:05] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 43, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:54:14] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:59:35] (03CR) 10Ryan Kemper: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1049873 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [05:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:05:47] (03CR) 10Ryan Kemper: [C:03+2] sre.hadoop.reboot-workers: use ceil not floor [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [05:26:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:15] (03PS3) 10Ryan Kemper: query_service: Add Access-Control-Allow-Headers [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [05:28:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [05:32:23] (03CR) 10Ryan Kemper: [C:03+1] "LGTM. We can merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [05:52:12] (03PS8) 10Ryan Kemper: wdqs: add the query main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [05:53:13] (03PS9) 10Ryan Kemper: wdqs: add the query main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:00:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049555 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T0600) [06:00:05] marostegui, Amir1, and arnaudb: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T0600) [06:00:41] (03Merged) 10jenkins-bot: mariadb: disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049555 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb) [06:01:30] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1049555|mariadb: disable writes on es6 (T368401)]] [06:01:36] T368401: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T368401 [06:01:54] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:04:03] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1049555|mariadb: disable writes on es6 (T368401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:04:08] !log arnaudb@deploy1002 arnaudb: Continuing with sync [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:31] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1049555|mariadb: disable writes on es6 (T368401)]] (duration: 08m 00s) [06:09:39] T368401: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T368401 [06:10:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es6 T368401 [06:10:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T368401 [06:10:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es1038 with weight 0 T368401', diff saved to https://phabricator.wikimedia.org/P65507 and previous config saved to /var/cache/conftool/dbconfig/20240627-061055-arnaudb.json [06:15:18] (03PS2) 10Gerrit maintenance bot: mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1049550 (https://phabricator.wikimedia.org/T368401) [06:15:24] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1049550 (https://phabricator.wikimedia.org/T368401) (owner: 10Gerrit maintenance bot) [06:15:28] (03CR) 10Arnaudb: [V:03+2 C:03+2] mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1049550 (https://phabricator.wikimedia.org/T368401) (owner: 10Gerrit maintenance bot) [06:15:55] !log Starting es6 eqiad failover from es1037 to es1038 - T368401 [06:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:01] T368401: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T368401 [06:16:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es1038 to es6 primary T368401', diff saved to https://phabricator.wikimedia.org/P65508 and previous config saved to /var/cache/conftool/dbconfig/20240627-061639-arnaudb.json [06:16:50] (03PS10) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:20:55] (03PS1) 10Arnaudb: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1050095 (https://phabricator.wikimedia.org/T368401) [06:21:04] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:21:15] (03CR) 10Arnaudb: [C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1050095 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb) [06:21:50] (03Abandoned) 10Arnaudb: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049551 (https://phabricator.wikimedia.org/T368401) (owner: 10Gerrit maintenance bot) [06:23:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'weight es1037 T368401', diff saved to https://phabricator.wikimedia.org/P65509 and previous config saved to /var/cache/conftool/dbconfig/20240627-062338-arnaudb.json [06:23:44] T368401: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T368401 [06:24:11] (03PS1) 10Arnaudb: Revert "mariadb: disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050096 [06:25:22] (03CR) 10Ryan Kemper: wdqs: add main and scholarly role assignments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:29:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:16] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050069 (https://phabricator.wikimedia.org/T368260) (owner: 10Dzahn) [06:31:27] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) (owner: 10Dzahn) [06:31:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arnaudb@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050096 (owner: 10Arnaudb) [06:32:12] (03Merged) 10jenkins-bot: Revert "mariadb: disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050096 (owner: 10Arnaudb) [06:32:43] !log arnaudb@deploy1002 Started scap: Backport for [[gerrit:1050096|Revert "mariadb: disable writes on es6"]] [06:35:16] !log arnaudb@deploy1002 arnaudb: Backport for [[gerrit:1050096|Revert "mariadb: disable writes on es6"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:35:24] !log arnaudb@deploy1002 arnaudb: Continuing with sync [06:35:59] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050049 (https://phabricator.wikimedia.org/T367872) (owner: 10Dzahn) [06:37:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52196 bytes in 4.945 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:09] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:39:52] (03PS2) 10KartikMistry: Enable MinT for Wikipedia readers MVP on a set of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049898 (https://phabricator.wikimedia.org/T363465) [06:40:13] (03PS2) 10Slyngshede: R:idp_test: Separate testing environment for CAS 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) [06:40:27] !log arnaudb@deploy1002 Finished scap: Backport for [[gerrit:1050096|Revert "mariadb: disable writes on es6"]] (duration: 07m 43s) [06:40:31] (03CR) 10Marostegui: "Thanks a lot Scott" [software] - 10https://gerrit.wikimedia.org/r/1049648 (owner: 10Scott French) [06:45:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'weight es1038 T368401', diff saved to https://phabricator.wikimedia.org/P65510 and previous config saved to /var/cache/conftool/dbconfig/20240627-064506-arnaudb.json [06:45:17] T368401: Switchover es6 master (es1037 -> es1038) - https://phabricator.wikimedia.org/T368401 [06:52:30] (03CR) 10Slyngshede: R:idp_test: Separate testing environment for CAS 7 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T0700). Please do the needful. [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:18] * kart_ is here [07:01:48] KCVelaga: We will deploy your patch after I finish first config patch. [07:02:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049898 (https://phabricator.wikimedia.org/T363465) (owner: 10KartikMistry) [07:03:27] (03Merged) 10jenkins-bot: Enable MinT for Wikipedia readers MVP on a set of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049898 (https://phabricator.wikimedia.org/T363465) (owner: 10KartikMistry) [07:03:52] kart_ okay, thank you. [07:04:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1049898|Enable MinT for Wikipedia readers MVP on a set of pilot wikis (T363465)]] [07:04:13] T363465: Enable MinT for Wikipedia readers MVP on a set of pilot wikis - https://phabricator.wikimedia.org/T363465 [07:06:36] !log kartik@deploy1002 kartik: Backport for [[gerrit:1049898|Enable MinT for Wikipedia readers MVP on a set of pilot wikis (T363465)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:54] (03PS1) 10Ayounsi: magru/EdgeUno: don't re advertise anycast in NA and EU [homer/public] - 10https://gerrit.wikimedia.org/r/1050203 [07:10:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:10:53] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining Search roles [puppet] - 10https://gerrit.wikimedia.org/r/1049873 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:11:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049859 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:13:19] !log kartik@deploy1002 kartik: Continuing with sync [07:16:14] (03CR) 10Slyngshede: [C:03+2] R:idp_test: Separate testing environment for CAS 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [07:16:25] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for swift/ceph [puppet] - 10https://gerrit.wikimedia.org/r/1049859 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:18:00] (03PS1) 10Muehlenhoff: Remove acmechief annotations for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1050205 (https://phabricator.wikimedia.org/T365799) [07:18:27] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1049898|Enable MinT for Wikipedia readers MVP on a set of pilot wikis (T363465)]] (duration: 14m 19s) [07:18:32] T363465: Enable MinT for Wikipedia readers MVP on a set of pilot wikis - https://phabricator.wikimedia.org/T363465 [07:19:17] (03PS2) 10Muehlenhoff: Switch acmechief1001/2001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) [07:19:18] KCVelaga: let's start deployment. [07:19:31] KCVelaga: is it possible to test on debug server(s)? [07:20:53] Okay, I am not very sure about debug servers. [07:22:03] (03PS3) 10Muehlenhoff: Switch acmechief1001/2001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) [07:22:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:22:26] KCVelaga: https://wikitech.wikimedia.org/wiki/WikimediaDebug - you can install Firefox/Chrome extension and if patch can be tested using browsers, it is easy to test with it. [07:22:28] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1050205 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:23:41] KCVelaga: In our case, It seems difficult to test on the browser, right? [07:23:50] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9929348 (10jcrespo) p:05Low→03Medium I got another error at backup2002 (es5): ` 2024-06-26 17:07:31 [ERROR] - Could not read data... [07:24:15] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp-test1004.wikimedia.org [07:24:16] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [07:24:43] kart_ yes, browser testing might not be possible [07:25:03] OK. Then, I'll just deploy it. [07:25:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [07:27:00] (03Merged) 10jenkins-bot: Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [07:27:28] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1048393|Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. (T368028)]] [07:27:34] T368028: MinT for Readers instrumentation: setup stream configuration and registration - https://phabricator.wikimedia.org/T368028 [07:28:46] (03PS1) 10Muehlenhoff: Switch acmechief to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1050244 (https://phabricator.wikimedia.org/T365799) [07:29:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050244 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:30:01] !log kartik@deploy1002 kcvelaga, kartik: Backport for [[gerrit:1048393|Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. (T368028)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:30:32] FIRING: SystemdUnitFailed: cfssl-ocsprefresh-aux_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:08] !log kartik@deploy1002 kcvelaga, kartik: Continuing with sync [07:31:44] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1050203 (owner: 10Ayounsi) [07:32:25] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1004.wikimedia.org - slyngshede@cumin1002" [07:32:27] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#9929377 (10MoritzMuehlenhoff) [07:33:15] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc2001.wikimedia.org [07:33:55] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1004.wikimedia.org - slyngshede@cumin1002" [07:33:56] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:33:56] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp-test1004.wikimedia.org on all recursors [07:33:59] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test1004.wikimedia.org on all recursors [07:34:27] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1004.wikimedia.org - slyngshede@cumin1002" [07:35:43] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1004.wikimedia.org - slyngshede@cumin1002" [07:36:07] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1004.wikimedia.org with OS bookworm [07:36:11] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1048393|Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. (T368028)]] (duration: 08m 42s) [07:36:17] T368028: MinT for Readers instrumentation: setup stream configuration and registration - https://phabricator.wikimedia.org/T368028 [07:36:21] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm [07:37:37] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:39:31] (03PS1) 10Jelto: aptrepo: revert gitlab-ce version to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1050246 (https://phabricator.wikimedia.org/T368565) [07:39:50] (03CR) 10Ayounsi: [C:03+2] magru/EdgeUno: don't re advertise anycast in NA and EU [homer/public] - 10https://gerrit.wikimedia.org/r/1050203 (owner: 10Ayounsi) [07:40:23] KCVelaga: deployed. [07:40:24] (03Merged) 10jenkins-bot: magru/EdgeUno: don't re advertise anycast in NA and EU [homer/public] - 10https://gerrit.wikimedia.org/r/1050203 (owner: 10Ayounsi) [07:40:44] (03CR) 10Jelto: "Is this possible to go lower the version number of the gitlab-ce package again?" [puppet] - 10https://gerrit.wikimedia.org/r/1050246 (https://phabricator.wikimedia.org/T368565) (owner: 10Jelto) [07:42:10] kart_ thank you. Everything looks good. I just checked on igwiki [07:42:24] Nice! [07:45:43] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1022 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65512 and previous config saved to /var/cache/conftool/dbconfig/20240627-074542-jynus.json [07:45:49] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [07:46:24] (03CR) 10Muehlenhoff: "Only indirectly, reprepro in our version doesn't support downgrades. You'd have to remove it and then re-reimport the other version. Or al" [puppet] - 10https://gerrit.wikimedia.org/r/1050246 (https://phabricator.wikimedia.org/T368565) (owner: 10Jelto) [07:50:23] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:51:38] (03CR) 10Muehlenhoff: "It's not an adequate fix to disable this randomly on some roles, this will only cause issues when we at some point need IP filterung for t" [puppet] - 10https://gerrit.wikimedia.org/r/1050080 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [07:54:48] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65513 and previous config saved to /var/cache/conftool/dbconfig/20240627-075447-jynus.json [07:54:53] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [07:56:20] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1022 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65514 and previous config saved to /var/cache/conftool/dbconfig/20240627-075620-jynus.json [07:57:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:57:36] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:59:25] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:45] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65515 and previous config saved to /var/cache/conftool/dbconfig/20240627-075944-jynus.json [08:00:04] jeena and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T0800). [08:01:27] (03PS1) 10Muehlenhoff: Remove and reclaim GIDs for legacy analytics group [puppet] - 10https://gerrit.wikimedia.org/r/1050252 [08:04:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:06:19] (03CR) 10Jelto: [C:03+2] aptrepo: revert gitlab-ce version to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1050246 (https://phabricator.wikimedia.org/T368565) (owner: 10Jelto) [08:08:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:09:20] (03PS1) 10Ayounsi: Tell EdgeUno to not re-advertise anycast from Novaacore to EU/NA [homer/public] - 10https://gerrit.wikimedia.org/r/1050253 [08:09:53] (03CR) 10Cathal Mooney: [C:03+1] Tell EdgeUno to not re-advertise anycast from Novaacore to EU/NA [homer/public] - 10https://gerrit.wikimedia.org/r/1050253 (owner: 10Ayounsi) [08:10:11] (03CR) 10Ayounsi: [C:03+2] Tell EdgeUno to not re-advertise anycast from Novaacore to EU/NA [homer/public] - 10https://gerrit.wikimedia.org/r/1050253 (owner: 10Ayounsi) [08:10:16] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1022 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65516 and previous config saved to /var/cache/conftool/dbconfig/20240627-081016-jynus.json [08:10:22] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:10:40] (03Merged) 10jenkins-bot: Tell EdgeUno to not re-advertise anycast from Novaacore to EU/NA [homer/public] - 10https://gerrit.wikimedia.org/r/1050253 (owner: 10Ayounsi) [08:10:45] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65517 and previous config saved to /var/cache/conftool/dbconfig/20240627-081044-jynus.json [08:17:05] (03PS1) 10Muehlenhoff: Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 [08:18:18] (03PS1) 10Fabfur: benthos:cache: using tcp socket to send syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050258 (https://phabricator.wikimedia.org/T365718) [08:20:37] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:23:04] (03PS2) 10Muehlenhoff: Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 [08:23:52] (03PS3) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) [08:23:52] (03PS1) 10Jcrespo: dbbackups: Reenable es backups, also enable ro ones for archival [puppet] - 10https://gerrit.wikimedia.org/r/1050259 (https://phabricator.wikimedia.org/T363812) [08:26:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:26:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:26:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc2001.wikimedia.org [08:27:00] 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9929499 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `irc2001.wikimedia.org` - irc2001.wikimedia.org (**PASS**) - Downtimed host... [08:27:10] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc1001.wikimedia.org [08:28:20] (03PS1) 10Muehlenhoff: Remove irc1001/irc2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1050260 (https://phabricator.wikimedia.org/T331702) [08:29:15] RESOLVED: SystemdUnitFailed: cfssl-ocsprefresh-aux_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:44] (03CR) 10Ayounsi: [C:03+1] Update aggregate route creation policy for network pops (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [08:30:44] (03PS1) 10Muehlenhoff: Remove irc1001/irc2001 from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) [08:33:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:33:44] (03CR) 10Elukey: [C:03+1] "left a comment, a blast from the past! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1050252 (owner: 10Muehlenhoff) [08:35:32] (03CR) 10Muehlenhoff: [C:03+2] Remove irc1001/irc2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1050260 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:36:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:37:28] (03CR) 10Muehlenhoff: Remove and reclaim GIDs for legacy analytics group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050252 (owner: 10Muehlenhoff) [08:37:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:37:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc1001.wikimedia.org [08:37:55] 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9929520 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `irc1001.wikimedia.org` - irc1001.wikimedia.org (**PASS... [08:38:33] (03CR) 10Elukey: [C:04-1] "Seems not building fine, still WIP" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:39:25] 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9929521 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The old nodes have been decommissioned, all done. [08:40:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [08:40:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [08:40:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T364069)', diff saved to https://phabricator.wikimedia.org/P65518 and previous config saved to /var/cache/conftool/dbconfig/20240627-084043-marostegui.json [08:40:50] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:45:43] (03PS1) 10Ayounsi: Tox: add python 3.12 support [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 [08:46:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597 (10MoritzMuehlenhoff) 03NEW [08:47:19] (03CR) 10Jcrespo: [C:03+2] dbbackups: Reenable es backups, also enable ro ones for archival [puppet] - 10https://gerrit.wikimedia.org/r/1050259 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:47:29] (03PS2) 10Jcrespo: dbbackups: Reenable es backups, also enable ro ones for archival [puppet] - 10https://gerrit.wikimedia.org/r/1050259 (https://phabricator.wikimedia.org/T363812) [08:47:45] (03PS1) 10Muehlenhoff: Remove ganeti1019 from list of active Ganeti nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1050263 (https://phabricator.wikimedia.org/T368597) [08:48:21] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test1004.wikimedia.org with OS bookworm [08:48:22] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host idp-test1004.wikimedia.org [08:48:33] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm executed with errors: -... [08:49:15] (03CR) 10Majavah: [C:03+2] conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [08:49:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [08:50:01] (03PS10) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [08:50:09] (03CR) 10Majavah: [C:03+1] Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [08:50:23] (03CR) 10CI reject: [V:04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [08:51:53] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti1019 from list of active Ganeti nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1050263 (https://phabricator.wikimedia.org/T368597) (owner: 10Muehlenhoff) [08:52:49] (03CR) 10Jcrespo: [V:03+2 C:03+2] dbbackups: Reenable es backups, also enable ro ones for archival [puppet] - 10https://gerrit.wikimedia.org/r/1050259 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:54:11] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1019.eqiad.wmnet [08:54:14] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:56:02] (03CR) 10Hnowlan: [C:03+1] Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [08:56:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9929576 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:58:45] (03PS1) 10Majavah: Replace 'labweb' variable names with 'cloudweb' [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) [08:59:08] (03CR) 10CI reject: [V:04-1] Replace 'labweb' variable names with 'cloudweb' [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [08:59:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:59:21] (03PS4) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) [08:59:21] (03PS1) 10Jcrespo: dbbackups: Fix backup name conflict, disable regular backups [puppet] - 10https://gerrit.wikimedia.org/r/1050265 (https://phabricator.wikimedia.org/T363812) [08:59:21] (03CR) 10Btullis: [C:03+1] Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [08:59:34] (03PS2) 10Jcrespo: dbbackups: Fix backup name conflict, disable regular backups [puppet] - 10https://gerrit.wikimedia.org/r/1050265 (https://phabricator.wikimedia.org/T363812) [08:59:48] (03PS2) 10Majavah: Replace 'labweb' variable names with 'cloudweb' [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) [09:00:28] (03CR) 10JMeybohm: "I think it's because this is a plain file rather than a template and PCC does not or is not able to create diffs for those" [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [09:00:59] 10 [09:01:01] (03PS11) 10Cathal Mooney: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [09:01:01] err :) [09:01:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1019.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:01:58] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1050252 (owner: 10Muehlenhoff) [09:04:03] (03PS4) 10David Caro: horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 [09:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1019.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti1019.eqiad.wmnet [09:04:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9929588 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1019.eqiad.wmnet` - ganeti1019.eqiad.wmnet (**FAIL**) - //Ho... [09:04:36] (03CR) 10CI reject: [V:04-1] Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:04:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9929589 (10MoritzMuehlenhoff) [09:04:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9929590 (10MoritzMuehlenhoff) a:03Jclark-ctr [09:05:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 11 CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compil" [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [09:05:50] (03CR) 10Muehlenhoff: [C:03+2] Remove and reclaim GIDs for legacy analytics group [puppet] - 10https://gerrit.wikimedia.org/r/1050252 (owner: 10Muehlenhoff) [09:06:40] (03PS1) 10Slyngshede: R:idp_test Remove references to host that does not yet exist. [puppet] - 10https://gerrit.wikimedia.org/r/1050266 (https://phabricator.wikimedia.org/T367487) [09:07:08] (03CR) 10Vgutierrez: "given that we are considering TCP here, maybe an UDS (uxst@) could work too?" [puppet] - 10https://gerrit.wikimedia.org/r/1050258 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [09:08:13] (03CR) 10Jcrespo: [C:03+2] dbbackups: Fix backup name conflict, disable regular backups [puppet] - 10https://gerrit.wikimedia.org/r/1050265 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [09:09:00] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [09:09:18] (03CR) 10Jgiannelos: [C:03+2] pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [09:09:38] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp-test2004.wikimedia.org [09:09:40] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [09:10:15] (03Abandoned) 10Slyngshede: R:idp_test Remove references to host that does not yet exist. [puppet] - 10https://gerrit.wikimedia.org/r/1050266 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:10:20] (03Merged) 10jenkins-bot: pcs: Enable resource change events on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049530 (owner: 10Jgiannelos) [09:12:03] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2004.wikimedia.org - slyngshede@cumin1002" [09:12:04] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 34820776 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:13:06] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2004.wikimedia.org - slyngshede@cumin1002" [09:13:06] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:06] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp-test2004.wikimedia.org on all recursors [09:13:09] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test2004.wikimedia.org on all recursors [09:13:37] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2004.wikimedia.org - slyngshede@cumin1002" [09:14:38] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2004.wikimedia.org - slyngshede@cumin1002" [09:15:06] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 928 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:16:13] (03PS1) 10Slyngshede: R:idp_test: hardend_tls to false. [puppet] - 10https://gerrit.wikimedia.org/r/1050269 (https://phabricator.wikimedia.org/T367487) [09:18:35] (03CR) 10Slyngshede: [C:03+2] R:idp_test: hardend_tls to false. [puppet] - 10https://gerrit.wikimedia.org/r/1050269 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:21:07] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test2004.wikimedia.org with OS bookworm [09:21:22] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test2004.wikimedia.org with OS bookworm [09:21:38] (03CR) 10Kamila Součková: [C:03+1] testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [09:23:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [09:23:53] (03PS12) 10Cathal Mooney: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [09:24:29] (03PS1) 10Arturo Borrero Gonzalez: toolforge: remove references to PodSecurityPolicy [puppet] - 10https://gerrit.wikimedia.org/r/1050271 (https://phabricator.wikimedia.org/T368142) [09:26:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:25] (03PS1) 10Jcrespo: WIP backups [puppet] - 10https://gerrit.wikimedia.org/r/1050273 [09:27:59] (03CR) 10Fabfur: "This will work (I've already tried on a test server) but the socket needs to be created by Benthos in the HAProxy chroot (/var/lib/haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/1050258 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [09:33:34] (03CR) 10Hnowlan: [C:03+1] "lgtm mostly. There's a reference to mw-on-k8s.lua in hieradata/common/profile/trafficserver/backend.yaml:381 that needs clarifying in ligh" [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:34:52] (03PS1) 10Vgutierrez: haproxy,varnish: Handle internal only HTTP headers on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) [09:35:57] (03CR) 10Muehlenhoff: "Initial round of comments" [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:37:29] (03PS2) 10Vgutierrez: haproxy,varnish: Handle internal only HTTP headers on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) [09:37:36] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: remove references to PodSecurityPolicy [puppet] - 10https://gerrit.wikimedia.org/r/1050271 (https://phabricator.wikimedia.org/T368142) (owner: 10Arturo Borrero Gonzalez) [09:38:17] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test2004.wikimedia.org with reason: host reimage [09:38:33] (03CR) 10Vgutierrez: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1050258 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [09:38:39] (03PS5) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) [09:38:39] (03PS2) 10Clément Goubert: trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) [09:39:00] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) (owner: 10Vgutierrez) [09:39:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9929683 (10ABran-WMF) [09:40:48] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test2004.wikimedia.org with reason: host reimage [09:41:41] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:42:11] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:50:56] (03CR) 10Jforrester: "This broke e.g. https://foundation.wikimedia.org/wiki/Fundraising :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742413 (owner: 10Urbanecm) [09:51:30] (03PS1) 10Jgiannelos: mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 [09:51:37] (03CR) 10CI reject: [V:04-1] mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 (owner: 10Jgiannelos) [09:51:43] (03PS2) 10Jgiannelos: mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 [09:52:06] (03PS2) 10Hnowlan: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) [09:53:48] (03CR) 10Vgutierrez: "we have two options here, implement `ensure => absent` support on `trafficserver::lua_script` or after puppet runs on the CDN clusters cle" [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:54:54] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 (owner: 10Jgiannelos) [09:55:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (backup1002, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:55:41] (03PS10) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) [09:55:42] (03PS15) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) [09:55:42] (03PS21) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) [09:55:55] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 (owner: 10Jgiannelos) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1000) [10:00:56] (03CR) 10Btullis: "@ltoscano@wikimedia.org - I have amended this commit by putting additional if-guards around some of the clusterrole RBAC rights." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:01:38] (03CR) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:02:06] (03Merged) 10jenkins-bot: mobileapps: Fix broken config yaml definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050280 (owner: 10Jgiannelos) [10:04:22] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:04:32] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:08:10] (03PS6) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) [10:09:28] 06SRE, 06Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618#9929835 (10Vgutierrez) 05Open→03Resolved [10:10:19] (03CR) 10Elukey: [C:03+1] Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:10:35] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy wiki suffix check for revertrisk models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050286 [10:11:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [10:14:04] (03CR) 10Ilias Sarantopoulos: [C:03+1] "The prefix indeed doesn't add anything (everything is a test) but I'd keep it since all the other httpbb modules follow the same paradigm" [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [10:14:24] (03CR) 10FNegri: [C:03+1] "Thanks for this one Taavi! I didn't do a thourough search in the repo, but I checked the diff and the PCC and all looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [10:14:45] (03CR) 10Klausman: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [10:14:55] (03CR) 10Klausman: [C:03+2] httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [10:16:14] (03CR) 10Majavah: [V:03+1 C:03+2] Replace 'labweb' variable names with 'cloudweb' [puppet] - 10https://gerrit.wikimedia.org/r/1050264 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [10:17:19] (03CR) 10AikoChou: [C:03+1] ml-services: deploy wiki suffix check for revertrisk models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050286 (owner: 10Ilias Sarantopoulos) [10:18:02] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy wiki suffix check for revertrisk models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050286 (owner: 10Ilias Sarantopoulos) [10:19:04] (03Merged) 10jenkins-bot: ml-services: deploy wiki suffix check for revertrisk models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050286 (owner: 10Ilias Sarantopoulos) [10:20:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:21:04] (03CR) 10Ladsgroup: [C:03+1] "You can deploy this by going to deploy1002 and run:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:24:00] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:24:17] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:24:21] (03CR) 10Jforrester: [C:03+1] "Should be good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [10:27:15] (03PS3) 10Clément Goubert: trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) [10:27:31] (03CR) 10Fabfur: [C:03+2] benthos:cache: using tcp socket to send syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050258 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [10:28:10] !log disable puppet on A:cp-ulsfo to apply selectively https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050258 (T365718) [10:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:16] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [10:29:40] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:29:51] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:30:23] !log correcting previous statement: puppet disabled just on A:cp-text_ulsfo [10:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] (03PS4) 10Clément Goubert: trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) [10:34:06] (03CR) 10Hnowlan: [C:03+1] prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [10:34:51] (03PS1) 10Fabfur: Revert "benthos:cache: using tcp socket to send syslog messages from haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/1050297 [10:37:14] (03CR) 10Cathal Mooney: Update aggregate route creation policy for network pops (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [10:38:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:38:21] (03PS5) 10Clément Goubert: trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) [10:38:45] (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: using tcp socket to send syslog messages from haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/1050297 (owner: 10Fabfur) [10:38:46] (03PS7) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) [10:38:52] (03PS5) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [10:38:53] (03PS5) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [10:38:53] (03PS5) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [10:38:54] (03PS5) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [10:38:55] (03PS5) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [10:38:56] (03PS6) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [10:39:06] (03CR) 10Hnowlan: [C:03+1] rest-gateway: route commons-impact via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [10:39:37] (03PS1) 10Clément Goubert: trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) [10:39:45] (03CR) 10Elukey: "Moved to a more predictable uid generation, now the image builds and afaics from a quick docker run everything checks out. Lemme know!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:41:06] (03CR) 10Clément Goubert: "I wrote an `ensure` implementation in I0331cca9f6e263cb3e555c54ee32a9779db20f8a, but I'm not opposed to going back to a manual cleanup if " [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:41:38] (03PS1) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) [10:42:40] (03PS4) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) [10:42:46] (03PS1) 10Jgiannelos: mobileapps: Enable trace logs for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050302 [10:43:10] (03PS2) 10Jgiannelos: mobileapps: Enable trace logs for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050302 [10:43:52] !log re-enabling puppet on A:cp-text_ulsfo (reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050297) (T365718) [10:43:53] (03CR) 10Giuseppe Lavagetto: "Generally lgtm; not sure why we're using the UID everywhere now. We actually just need to use it for the USER stanza." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:57] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [10:45:54] (03PS2) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) [10:45:55] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:46:10] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:46:20] (03CR) 10Clément Goubert: [C:03+2] prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [10:47:35] (03CR) 10Vgutierrez: trafficserver::lua_script: Implement ensure param (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:48:41] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test2004.wikimedia.org with OS bookworm [10:48:41] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test2004.wikimedia.org [10:48:54] (03Merged) 10jenkins-bot: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [10:49:09] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1004.wikimedia.org with OS bookworm [10:50:36] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test2004.wikimedia.org with OS bookworm completed: - idp-test200... [10:50:36] (03PS2) 10Clément Goubert: trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) [10:50:48] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9930000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm [10:51:10] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:51:27] !log Deploying new prometheus-php-fpm-exporter, prometheus-apache-exporter to mw-on-k8s and shellbox - T283861 [10:51:30] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:37] T283861: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861 [10:52:55] (03PS6) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [10:52:55] (03PS6) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [10:52:56] (03PS6) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [10:52:56] (03PS6) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [10:52:56] (03PS6) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [10:52:58] (03PS7) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [10:53:45] (03CR) 10Elukey: mcrouter: upgrade to Bookworm (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:55:05] (03PS1) 10Arturo Borrero Gonzalez: kubedm: absent psp directory [puppet] - 10https://gerrit.wikimedia.org/r/1050306 (https://phabricator.wikimedia.org/T368142) [10:55:40] (03CR) 10Elukey: "If there is a way to just use the uid created by the mcrouter package with the USER stanza I can change my patch, I didn't find it so I ca" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:55:43] (03PS2) 10Arturo Borrero Gonzalez: kubedm: absent psp directory [puppet] - 10https://gerrit.wikimedia.org/r/1050306 (https://phabricator.wikimedia.org/T368142) [10:57:07] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050306 (https://phabricator.wikimedia.org/T368142) (owner: 10Arturo Borrero Gonzalez) [11:00:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:00:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:00:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:00:58] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] kubedm: absent psp directory [puppet] - 10https://gerrit.wikimedia.org/r/1050306 (https://phabricator.wikimedia.org/T368142) (owner: 10Arturo Borrero Gonzalez) [11:03:28] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1004.wikimedia.org with reason: host reimage [11:03:31] (03PS1) 10Clément Goubert: mediawiki, shellbox: Fix prometheus-httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050307 (https://phabricator.wikimedia.org/T283861) [11:03:48] (03PS2) 10Clément Goubert: mediawiki, shellbox: Fix prometheus-httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050307 (https://phabricator.wikimedia.org/T283861) [11:06:02] (03PS1) 10Btullis: Switch ceph server firewall to nftables and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) [11:06:34] jouncebot: nowandnext [11:06:34] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [11:06:34] In 0 hour(s) and 53 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1200) [11:06:52] (03CR) 10Clément Goubert: [C:03+2] mediawiki, shellbox: Fix prometheus-httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050307 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:07:07] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1004.wikimedia.org with reason: host reimage [11:07:27] (03PS1) 10Urbanecm: ptwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050309 (https://phabricator.wikimedia.org/T368310) [11:07:59] (03PS2) 10Urbanecm: CommunityConfiguration: Log info and higher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 [11:08:15] (03PS2) 10Btullis: Switch ceph server firewall to nftables and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) [11:08:31] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: remove reference to PSP directory [puppet] - 10https://gerrit.wikimedia.org/r/1050310 (https://phabricator.wikimedia.org/T368142) [11:08:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050309 (https://phabricator.wikimedia.org/T368310) (owner: 10Urbanecm) [11:08:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 (owner: 10Urbanecm) [11:09:04] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] kubeadm: remove reference to PSP directory [puppet] - 10https://gerrit.wikimedia.org/r/1050310 (https://phabricator.wikimedia.org/T368142) (owner: 10Arturo Borrero Gonzalez) [11:10:25] (03Merged) 10jenkins-bot: mediawiki, shellbox: Fix prometheus-httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050307 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:10:49] (03PS3) 10Btullis: Switch ceph server firewall to nftables and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) [11:12:57] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:13:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:13:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:13:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:13:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:14:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:14:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:16:11] (03PS4) 10Btullis: Switch ceph server firewall to nftables and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) [11:17:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3085/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:17:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:17:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:18:10] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] "I raised an eyebrow before seeing it was for staging. Move along 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050302 (owner: 10Jgiannelos) [11:19:08] !log cgoubert@deploy1002 Started scap: Deploy new prometheus-php-fpm-exporter, prometheus-apache-exporter - T283861 [11:19:14] T283861: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861 [11:20:27] (03PS1) 10Jelto: Revert "aptrepo: revert gitlab-ce version to 16.11" [puppet] - 10https://gerrit.wikimedia.org/r/1050314 (https://phabricator.wikimedia.org/T365675) [11:20:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 (owner: 10Gergő Tisza) [11:22:03] jouncebot: nowandnext [11:22:03] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [11:22:03] In 0 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1200) [11:24:26] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1004.wikimedia.org with OS bookworm [11:24:38] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9930092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm completed: - idp-test100... [11:24:57] (03CR) 10Btullis: [V:03+1] "This will need a rolling reboot of the ceph cluster, to pick up the change to an nftables based firewall." [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:25:03] !log cgoubert@deploy1002 Finished scap: Deploy new prometheus-php-fpm-exporter, prometheus-apache-exporter - T283861 (duration: 06m 17s) [11:25:08] T283861: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861 [11:26:57] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:27:22] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:27:28] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [11:27:42] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [11:27:48] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [11:28:07] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [11:28:13] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:28:34] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:28:41] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [11:28:57] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [11:29:03] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [11:29:45] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [11:30:42] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:31:52] (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050317 [11:32:04] (03PS2) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050317 [11:32:20] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050317 (owner: 10Jgiannelos) [11:32:45] (03CR) 10Jgiannelos: [C:03+2] "FYI the docker image wasn't the latest" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050317 (owner: 10Jgiannelos) [11:32:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [11:32:53] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [11:33:13] (03Merged) 10jenkins-bot: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050317 (owner: 10Jgiannelos) [11:33:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [11:33:34] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:34:12] (03PS1) 10Clément Goubert: Fix prometheus-httpd-exporter flags for latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050321 (https://phabricator.wikimedia.org/T283861) [11:34:14] (03PS1) 10Clément Goubert: Copy httpd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050322 (https://phabricator.wikimedia.org/T283861) [11:34:15] (03PS1) 10Clément Goubert: lamp.httpd: Fix httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050323 (https://phabricator.wikimedia.org/T283861) [11:34:17] (03PS1) 10Clément Goubert: machinetranslation: Update httpd.lamp module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050324 (https://phabricator.wikimedia.org/T283861) [11:34:26] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:34:32] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:34:48] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:34:56] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:35:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [11:35:08] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [11:35:11] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:35:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [11:35:40] (03PS1) 10Slyngshede: IDP-Test: Switch to CAS7 hosts [dns] - 10https://gerrit.wikimedia.org/r/1050326 [11:35:45] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:36:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:36:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:38:05] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [11:38:13] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Enable trace logs for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050302 (owner: 10Jgiannelos) [11:38:34] (03CR) 10Kamila Součková: [C:03+1] Fix prometheus-httpd-exporter flags for latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050321 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:38:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:38:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:39:04] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:39:07] (03Merged) 10jenkins-bot: mobileapps: Enable trace logs for debugging purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050302 (owner: 10Jgiannelos) [11:39:09] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:39:21] (03CR) 10Muehlenhoff: "Better split this into two patches: First switch to firewall::service (which is fully backwards compatible by emitting ferm::service in th" [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:39:32] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:39:38] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:39:41] (03CR) 10Kamila Součková: [C:03+1] lamp.httpd: Fix httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050323 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:39:47] (03CR) 10Kamila Součková: [C:03+1] Copy httpd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050322 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:40:01] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:40:07] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:40:41] (03CR) 10Kamila Součková: [C:03+1] machinetranslation: Update httpd.lamp module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050324 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:40:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:40:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:41:31] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:41:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:41:47] (03CR) 10Clément Goubert: [C:03+2] Fix prometheus-httpd-exporter flags for latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050321 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:41:52] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:42:00] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:42:03] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:42:58] (03Merged) 10jenkins-bot: Fix prometheus-httpd-exporter flags for latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050321 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:46:16] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [11:46:27] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [11:46:33] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [11:46:49] (03PS1) 10Klausman: httpbb/liftwing: Actually deploy split-out file from change 1049943 [puppet] - 10https://gerrit.wikimedia.org/r/1050328 [11:46:56] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [11:47:01] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [11:47:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [11:47:49] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:47:49] FIRING: PuppetFailure: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:48:35] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:49:11] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:49:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:49:42] (03CR) 10Clément Goubert: [C:03+2] Copy httpd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050322 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:50:10] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 91 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:50:31] (03Merged) 10jenkins-bot: Copy httpd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050322 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:50:33] (03PS5) 10Btullis: Update ceph server firewall and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) [11:50:33] (03PS1) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) [11:50:35] (03PS1) 10Btullis: cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) [11:50:45] (03PS2) 10Clément Goubert: lamp.httpd: Fix httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050323 (https://phabricator.wikimedia.org/T283861) [11:51:07] (03PS2) 10Reedy: CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) [11:51:17] jouncebot: nowandnext [11:51:18] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [11:51:18] In 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1200) [11:51:26] Bah. [11:51:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3087/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:51:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:51:52] (03CR) 10Jforrester: "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) (owner: 10Reedy) [11:51:57] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:52:09] (03CR) 10Clément Goubert: [C:03+2] lamp.httpd: Fix httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050323 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:52:21] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3088/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:52:22] (03PS2) 10Clément Goubert: machinetranslation: Update httpd.lamp module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050324 (https://phabricator.wikimedia.org/T283861) [11:52:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T364069)', diff saved to https://phabricator.wikimedia.org/P65521 and previous config saved to /var/cache/conftool/dbconfig/20240627-115244-marostegui.json [11:52:45] (03CR) 10Ilias Sarantopoulos: [C:03+1] httpbb/liftwing: Actually deploy split-out file from change 1049943 [puppet] - 10https://gerrit.wikimedia.org/r/1050328 (owner: 10Klausman) [11:52:49] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:49] FIRING: [2x] PuppetFailure: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:50] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:52:55] (03Merged) 10jenkins-bot: lamp.httpd: Fix httpd-exporter flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050323 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [11:53:19] (03CR) 10Btullis: [V:03+1] "Good thinking, thanks. I have put the canary and the role switch to nftables as two subsequent patches in the stack." [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:53:46] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:54:30] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3089/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:55:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:57:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:57:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [11:57:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:57:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:58:46] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [11:58:59] (03CR) 10Btullis: [V:03+1 C:03+2] Update ceph server firewall and permit access from dse_kubepods [puppet] - 10https://gerrit.wikimedia.org/r/1050308 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [11:59:00] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [11:59:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [11:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [12:00:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1050326 (owner: 10Slyngshede) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1200) [12:00:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [12:00:33] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [12:00:41] (03CR) 10Slyngshede: [C:03+2] IDP-Test: Switch to CAS7 hosts [dns] - 10https://gerrit.wikimedia.org/r/1050326 (owner: 10Slyngshede) [12:04:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T367856)', diff saved to https://phabricator.wikimedia.org/P65522 and previous config saved to /var/cache/conftool/dbconfig/20240627-120435-marostegui.json [12:04:42] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:04:42] (03CR) 10Clément Goubert: [C:03+2] machinetranslation: Update httpd.lamp module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050324 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [12:05:08] (03PS1) 10Clément Goubert: Bump global prometheus-apache-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1050333 (https://phabricator.wikimedia.org/T283861) [12:05:47] (03Merged) 10jenkins-bot: machinetranslation: Update httpd.lamp module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050324 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [12:05:57] (03CR) 10Kamila Součková: [C:03+1] Bump global prometheus-apache-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1050333 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [12:06:11] (03PS1) 10Hashar: Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) [12:06:52] (03CR) 10Clément Goubert: [C:03+2] Bump global prometheus-apache-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1050333 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [12:07:18] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:07:25] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:07:32] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [12:07:44] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [12:07:49] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [12:07:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P65523 and previous config saved to /var/cache/conftool/dbconfig/20240627-120751-marostegui.json [12:07:54] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [12:10:46] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:10:52] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:10:56] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:11:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930200 (10Jclark-ctr) @akosiaris please update Site.pp file for this server [12:11:26] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:12:21] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:12:42] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:12:43] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:12:57] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:14:38] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:15:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:17:31] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:18:16] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1050341 (owner: 10L10n-bot) [12:19:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P65524 and previous config saved to /var/cache/conftool/dbconfig/20240627-121942-marostegui.json [12:22:06] (03CR) 10CI reject: [V:04-1] Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [12:22:22] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:22:41] (03PS1) 10Clément Goubert: Add deploy1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1050345 (https://phabricator.wikimedia.org/T364416) [12:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P65525 and previous config saved to /var/cache/conftool/dbconfig/20240627-122258-marostegui.json [12:23:03] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:23:26] (03PS2) 10Jcrespo: dbbackups: Disable further es backups until monday while ongoing [puppet] - 10https://gerrit.wikimedia.org/r/1050273 (https://phabricator.wikimedia.org/T363812) [12:24:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:07] (03PS3) 10Jcrespo: dbbackups: Disable further es backups until Monday while ongoing [puppet] - 10https://gerrit.wikimedia.org/r/1050273 (https://phabricator.wikimedia.org/T363812) [12:24:20] (03PS4) 10Jcrespo: dbbackups: Disable further es backups until Monday while ongoing [puppet] - 10https://gerrit.wikimedia.org/r/1050273 (https://phabricator.wikimedia.org/T363812) [12:24:29] (03CR) 10Vgutierrez: [V:03+1 C:03+2] prometheus::ops: Pull fifo_log_demux metrics [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:24:39] (03CR) 10Jcrespo: [V:03+2 C:03+2] dbbackups: Disable further es backups until Monday while ongoing [puppet] - 10https://gerrit.wikimedia.org/r/1050273 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:25:02] vgutierrez: merge? [12:25:05] go ahead please [12:25:24] ongoing... [12:26:25] vgutierrez: All done! [12:26:34] (03PS1) 10Slyngshede: data.yaml extend Trokhymovych to 2025. [puppet] - 10https://gerrit.wikimedia.org/r/1050351 [12:26:43] jynus: cool, thanks [12:28:11] (03CR) 10Muehlenhoff: [C:03+1] data.yaml extend Trokhymovych to 2025. [puppet] - 10https://gerrit.wikimedia.org/r/1050351 (owner: 10Slyngshede) [12:28:37] (03CR) 10Slyngshede: [C:03+2] data.yaml extend Trokhymovych to 2025. [puppet] - 10https://gerrit.wikimedia.org/r/1050351 (owner: 10Slyngshede) [12:28:49] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:29:22] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3093/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050328 (owner: 10Klausman) [12:30:00] (03PS2) 10Hashar: Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) [12:30:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:31:50] (03CR) 10Clément Goubert: trafficserver::lua_script: Implement ensure param (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:32:35] 06SRE, 10LPL Technical Support, 06serviceops, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#9930305 (10MaryMunyoki) [12:32:51] 06SRE, 10LPL Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9930306 (10MaryMunyoki) [12:33:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for deploy1003 - jclark@cumin1002" [12:33:01] (03PS6) 10Clément Goubert: trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) [12:33:05] (03PS3) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) [12:33:31] (03CR) 10CDanis: [C:03+1] Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [12:33:35] (03PS4) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) [12:34:06] (03PS1) 10Klausman: Revert "httpbb/liftwing: Split up test definitions by k8s NS" [puppet] - 10https://gerrit.wikimedia.org/r/1050354 [12:34:19] (03CR) 10Klausman: [V:03+2 C:03+2] Revert "httpbb/liftwing: Split up test definitions by k8s NS" [puppet] - 10https://gerrit.wikimedia.org/r/1050354 (owner: 10Klausman) [12:34:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P65526 and previous config saved to /var/cache/conftool/dbconfig/20240627-123450-marostegui.json [12:35:32] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:49] (03CR) 10Vgutierrez: "all varnish tests are happy for both text:" [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) (owner: 10Vgutierrez) [12:37:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for deploy1003 - jclark@cumin1002" [12:37:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T364069)', diff saved to https://phabricator.wikimedia.org/P65527 and previous config saved to /var/cache/conftool/dbconfig/20240627-123805-marostegui.json [12:38:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:38:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:38:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:39:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:39:15] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:05] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) (owner: 10Vgutierrez) [12:41:58] (03CR) 10CI reject: [V:04-1] Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [12:42:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye [12:42:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930347 (10Jclark-ctr) [12:43:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye [12:44:13] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9930357 (10Ladsgroup) I personally have no issue with giving root rights to people who have restricted or deployment rights in production (where they... [12:46:50] (03PS1) 10Andrew Bogott: Move cloudvirtlocal1001 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050356 (https://phabricator.wikimedia.org/T364457) [12:46:52] (03PS1) 10Andrew Bogott: Move cloudvirtlocal1002 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050357 (https://phabricator.wikimedia.org/T364457) [12:46:53] (03PS1) 10Andrew Bogott: Move cloudvirtlocal1003 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050358 (https://phabricator.wikimedia.org/T364457) [12:47:03] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:48:37] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-serve2007.codfw.wmnet with reason: Hardware maintenance for memory errors [12:48:53] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-serve2007.codfw.wmnet with reason: Hardware maintenance for memory errors [12:49:00] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:49:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9930380 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=14aea618-f5d5-481c-ab19-4eb0daea0ad6) set by klausman@cumin2002 for 4... [12:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T367856)', diff saved to https://phabricator.wikimedia.org/P65528 and previous config saved to /var/cache/conftool/dbconfig/20240627-124957-marostegui.json [12:50:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [12:50:02] !log sudo cumin 'A:dnsbox' 'rm /var/lib/dnsbox/ntp.state': remove obsolete ntp.state file [12:50:04] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [12:50:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T367856)', diff saved to https://phabricator.wikimedia.org/P65529 and previous config saved to /var/cache/conftool/dbconfig/20240627-125019-marostegui.json [12:52:43] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm [12:53:05] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirtlocal1001 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050356 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [12:53:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:53:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:15] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:55:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:57:49] FIRING: [2x] PuppetFailure: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:58:33] (03CR) 10Btullis: [C:03+2] Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [12:58:36] jouncebot next [12:58:37] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1300) [12:58:39] (03CR) 10Btullis: [C:03+2] Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [12:58:43] (03CR) 10Btullis: [C:03+2] Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [12:58:46] (03CR) 10Btullis: [C:03+2] Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [12:59:13] DreamRimmer: o/ [12:59:22] (03Merged) 10jenkins-bot: Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [12:59:24] (03Merged) 10jenkins-bot: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1300). [13:00:05] Dreamy_Jazz, DreamRimmer, hnowlan, urbanecm, tgr, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] \o [13:00:58] o/ [13:00:59] o/ [13:01:17] o/ [13:01:25] :P [13:01:30] I can deploy my change now [13:01:43] (03Merged) 10jenkins-bot: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [13:01:45] i can deploy today [13:01:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:01:56] (03Merged) 10jenkins-bot: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:02:03] Dreamy_Jazz: can you please abort your scap, or take multiple changes? [13:02:09] to speed up the window a bit :) [13:02:12] It just aborted for the need of a rebase [13:02:21] So it's no longer running. [13:02:28] ok. i'll take over then [13:02:30] (03PS3) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) [13:02:46] !log A:dnsbox: remove 10.3.0.2/32 from /e/n/i [13:02:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:02:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:02:49] (03CR) 10Urbanecm: [C:03+2] [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:55] (03PS2) 10Dreamrimmer: Add VK namespace alias to Azerbaijani Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) [13:03:08] (03CR) 10Urbanecm: [C:03+2] Add VK namespace alias to Azerbaijani Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) (owner: 10Dreamrimmer) [13:03:08] (03PS3) 10Hnowlan: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) [13:03:08] (03CR) 10Urbanecm: [C:03+2] testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:03:22] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) (owner: 10Dreamrimmer) [13:03:35] (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:03:44] (03Merged) 10jenkins-bot: Add VK namespace alias to Azerbaijani Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) (owner: 10Dreamrimmer) [13:03:53] my change can only be tested in prod unfortunately (has to get to the jobrunners) [13:04:00] hnowlan: ack, thanks for the info [13:04:06] I will be able to test my change. [13:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:06:18] (03PS4) 10Hnowlan: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) [13:06:21] (03CR) 10Urbanecm: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:06:25] (03CR) 10Urbanecm: [C:03+2] testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:06:32] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:06:41] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:07:36] (03Merged) 10jenkins-bot: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:08:07] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038742|[CheckUser] Stop writing old for event tables migration on all wikis (T360685)]], [[gerrit:1049970|testwiki: use shellbox-video for scaling video (T356241)]], [[gerrit:1049886|Add VK namespace alias to Azerbaijani Wikibooks (T368237)]] [13:08:17] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [13:08:17] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:08:18] T368237: Add "VK" namespace alias to Azerbaijani Wikibooks. - https://phabricator.wikimedia.org/T368237 [13:08:40] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [13:09:05] (03PS1) 10Fabfur: benthos:cache: using tcp proto for syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) [13:10:46] !log urbanecm@deploy1002 urbanecm, dreamrimmer, hnowlan, dreamyjazz: Backport for [[gerrit:1038742|[CheckUser] Stop writing old for event tables migration on all wikis (T360685)]], [[gerrit:1049970|testwiki: use shellbox-video for scaling video (T356241)]], [[gerrit:1049886|Add VK namespace alias to Azerbaijani Wikibooks (T368237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:04] Dreamy_Jazz: DreamRimmer: can you test your patches at mwdebug, please? [13:11:14] doing [13:11:38] (03PS2) 10Fabfur: benthos:cache: using tcp proto for syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) [13:11:41] doing [13:12:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [13:12:41] looks good to me https://az.wikibooks.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=namespaces%7Cnamespacealiases [13:13:04] ty [13:13:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9930437 (10Mcastro) Approved. [13:13:47] (03PS1) 10Btullis: cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) [13:14:24] (03CR) 10Brouberol: [C:03+1] cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:14:36] (03CR) 10Brouberol: cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:15:20] (03PS2) 10Btullis: cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) [13:15:21] Dreamy_Jazz: what about you? [13:15:27] Just finished testing now [13:15:37] Needed to check the database [13:15:54] urbanecm: [13:16:00] ok [13:18:30] (03CR) 10Brouberol: [C:03+1] cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:18:53] Is that everything tested then? [13:18:54] (03CR) 10Btullis: [C:03+2] cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:19:11] Dreamy_Jazz: once you give me the green light, yes :) [13:19:17] I already did? [13:19:29] you said "needed to check the database" [13:19:40] i thought that's yet to happen [13:19:44] but it appears it's not :) [13:19:48] !log urbanecm@deploy1002 urbanecm, dreamrimmer, hnowlan, dreamyjazz: Continuing with sync [13:19:51] let's go ahead then) [13:19:52] hnowlan: ^^ [13:19:53] Oh I thought the "just finished testing now" was the message [13:20:11] That was supposed to be in the past tense. [13:20:16] urbanecm: ack, thanks [13:20:21] ah, sorry. i misunderstood you then [13:22:06] (03Merged) 10jenkins-bot: cephosd: Do not set an encryption password if encryption is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050362 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:22:47] (03PS6) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [13:23:20] (03CR) 10Urbanecm: [C:03+2] Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [13:23:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9930461 (10Aklapper) @Mcastro Please also see my previous comment and fix your account - thanks! [13:23:25] (03PS2) 10Gergő Tisza: [noop] Remove $wgRedirectScript, not used since MediaWiki 1.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 [13:23:28] (03CR) 10Urbanecm: [C:03+2] [noop] Remove $wgRedirectScript, not used since MediaWiki 1.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 (owner: 10Gergő Tisza) [13:23:34] (03PS3) 10Urbanecm: CommunityConfiguration: Log info and higher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 [13:23:35] (03PS2) 10Klausman: httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1050355 [13:23:35] (03CR) 10Klausman: "The original change was broken and due to pcc not actually running P5 tests even when told to do so. I reverted that change, which makes m" [puppet] - 10https://gerrit.wikimedia.org/r/1050355 (owner: 10Klausman) [13:23:37] (03CR) 10Urbanecm: [C:03+2] CommunityConfiguration: Log info and higher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 (owner: 10Urbanecm) [13:24:15] (03Merged) 10jenkins-bot: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [13:24:21] (03Merged) 10jenkins-bot: [noop] Remove $wgRedirectScript, not used since MediaWiki 1.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 (owner: 10Gergő Tisza) [13:24:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038742|[CheckUser] Stop writing old for event tables migration on all wikis (T360685)]], [[gerrit:1049970|testwiki: use shellbox-video for scaling video (T356241)]], [[gerrit:1049886|Add VK namespace alias to Azerbaijani Wikibooks (T368237)]] (duration: 16m 48s) [13:25:01] deployed [13:25:01] (03Merged) 10jenkins-bot: CommunityConfiguration: Log info and higher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 (owner: 10Urbanecm) [13:25:03] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [13:25:04] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:25:04] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [13:25:04] T368237: Add "VK" namespace alias to Azerbaijani Wikibooks. - https://phabricator.wikimedia.org/T368237 [13:25:27] testing [13:25:36] hnowlan: please let me know if it results in issues :) [13:25:54] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1042430|Enable local uploads for Gilaki Wikipedia (T364673)]], [[gerrit:1048855|[noop] Remove $wgRedirectScript, not used since MediaWiki 1.22]], [[gerrit:1048419|CommunityConfiguration: Log info and higher]] [13:25:59] T364673: Allow local uploads on Gilaki Wikipedia - https://phabricator.wikimedia.org/T364673 [13:26:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:55] Thanks! [13:27:39] np [13:28:35] !log urbanecm@deploy1002 urbanecm, tgr, nmw03: Backport for [[gerrit:1042430|Enable local uploads for Gilaki Wikipedia (T364673)]], [[gerrit:1048855|[noop] Remove $wgRedirectScript, not used since MediaWiki 1.22]], [[gerrit:1048419|CommunityConfiguration: Log info and higher]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:49] testing [13:28:51] urbanecm: looks good, thank you for the deploy! [13:28:59] 06SRE, 10Incident Tooling: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201#9930470 (10CDanis) I think the CSS issue with the scrollbars appearing has been fixed, the tooltip boxes should be sized to fit the content now. I've tested on Linux Chromium stable, and Safar... [13:29:03] any time! thanks for the confirmation [13:29:07] Nemoralis: let me know how it looks like :) [13:29:22] tgr|away: fyi ^^, but your patch doesn't appear to be testable anyway [13:29:34] LGTM [13:30:01] I can upload files: https://glk.wikipedia.org/wiki/%D8%AE%D8%A7%D8%B5:%D8%A8%D8%A7%D8%B1%DA%AF%D8%B0%D8%A7%D8%B1%DB%8C_%D9%BE%D8%B1%D9%88%D9%86%D8%AF%D9%87 [13:30:02] and user permissions are OK [13:30:02] https://glk.wikipedia.org/wiki/%D8%AE%D8%A7%D8%B5:%D8%A7%D8%AE%D8%AA%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA_%DA%AF%D8%B1%D9%88%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1%DB%8C [13:31:09] yay! :) [13:31:11] !log urbanecm@deploy1002 urbanecm, tgr, nmw03: Continuing with sync [13:31:13] proceeding [13:31:21] (03PS2) 10Urbanecm: ptwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050309 (https://phabricator.wikimedia.org/T368310) [13:31:24] (03CR) 10Urbanecm: [C:03+2] ptwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050309 (https://phabricator.wikimedia.org/T368310) (owner: 10Urbanecm) [13:31:40] urbanecm: I checked some redirect page just in case, and it still works [13:31:49] great! [13:32:04] (03Merged) 10jenkins-bot: ptwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050309 (https://phabricator.wikimedia.org/T368310) (owner: 10Urbanecm) [13:32:59] (03CR) 10BBlack: [C:03+1] "Thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) (owner: 10Vgutierrez) [13:33:42] (03CR) 10Vgutierrez: [C:03+2] haproxy,varnish: Handle internal only HTTP headers on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1050275 (https://phabricator.wikimedia.org/T368557) (owner: 10Vgutierrez) [13:35:35] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:36:16] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1042430|Enable local uploads for Gilaki Wikipedia (T364673)]], [[gerrit:1048855|[noop] Remove $wgRedirectScript, not used since MediaWiki 1.22]], [[gerrit:1048419|CommunityConfiguration: Log info and higher]] (duration: 10m 22s) [13:36:21] T364673: Allow local uploads on Gilaki Wikipedia - https://phabricator.wikimedia.org/T364673 [13:36:27] Nemoralis: tgr|away: should be live [13:36:53] Anything else to deploy? [13:36:56] thanks! [13:36:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 479, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:57] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 559, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1050309|ptwiki: Enable CommunityConfiguration (T368310)]] [13:37:10] T368310: CommunityConfiguration: Release extension to Portuguese Wikipedia (ptwiki) - https://phabricator.wikimedia.org/T368310 [13:37:45] Reedy: i still have one patch in the window! :) [13:38:14] urbanecm: Any chance you could push https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1043043 out too? :) [13:38:21] Can just go straight live [13:38:26] sure [13:38:30] (03PS3) 10Reedy: CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) [13:38:33] thanks [13:38:35] (03CR) 10Urbanecm: [C:03+2] CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) (owner: 10Reedy) [13:38:53] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host deploy1003 [13:39:20] (03Merged) 10jenkins-bot: CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) (owner: 10Reedy) [13:39:40] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1050309|ptwiki: Enable CommunityConfiguration (T368310)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host deploy1003 [13:40:48] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9930538 (10Reedy) Trying to load https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ is still slow :( [13:40:59] !log Run `mwscript extensions/GrowthExperiments/maintenance/migrateCommunityConfig.php --wiki=ptwiki --force` via mwdebug1001 (T368310) [13:41:02] !log urbanecm@deploy1002 urbanecm: Continuing with sync [13:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm [13:45:07] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [13:46:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050309|ptwiki: Enable CommunityConfiguration (T368310)]] (duration: 08m 58s) [13:46:05] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [13:46:09] T368310: CommunityConfiguration: Release extension to Portuguese Wikipedia (ptwiki) - https://phabricator.wikimedia.org/T368310 [13:46:46] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1043043|CommonSettings: Mark REL1_42 as stable (T359850)]] [13:46:51] T359850: Mark REL1_42 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T359850 [13:47:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9930554 (10Jhancock.wm) Swapped A1 and A2 to see if error recurs/moves [13:48:27] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:50:21] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-worker1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:51:43] (03PS13) 10Cathal Mooney: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [13:53:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:39] 06SRE, 10Incident Tooling: wikimediastatus.net help popups are mobile-unfriendly and keyboard-inaccessible - https://phabricator.wikimedia.org/T327201#9930569 (10CDanis) p:05Medium→03Low [13:54:56] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1043043|CommonSettings: Mark REL1_42 as stable (T359850)]] (duration: 08m 10s) [13:55:05] T359850: Mark REL1_42 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T359850 [13:55:06] Reedy: done :) [13:55:13] cheers :D [13:55:22] anything else? :)) [13:56:52] (03CR) 10Cathal Mooney: Add DSCP marking options to current firewall classes (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:57:46] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: more batching [puppet] - 10https://gerrit.wikimedia.org/r/1050367 (https://phabricator.wikimedia.org/T367076) [13:58:18] (03CR) 10Kamila Součková: [C:04-1] "do not merge yet, I want to try something else first" [puppet] - 10https://gerrit.wikimedia.org/r/1050367 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [14:04:26] (03PS1) 10JMeybohm: httpbb: Real for push has changed in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1050371 (https://phabricator.wikimedia.org/T332016) [14:07:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9930627 (10Jhancock.wm) @Marostegui was there a preference for 1G or 10G on these servers? [14:08:46] (03PS3) 10Fabfur: benthos:cache: using tcp proto for syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) [14:08:56] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: test for T367076 [puppet] - 10https://gerrit.wikimedia.org/r/1050373 (https://phabricator.wikimedia.org/T367076) [14:10:02] (03PS2) 10JMeybohm: httpbb: Auth realm for pushing to docker registry has changed [puppet] - 10https://gerrit.wikimedia.org/r/1050371 (https://phabricator.wikimedia.org/T332016) [14:11:47] (03CR) 10JHathaway: [C:03+2] vrts: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [14:12:16] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bullseye [14:14:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:14:42] (03CR) 10Kamila Součková: [C:03+2] benthos/mw_accesslog_metrics: test for T367076 [puppet] - 10https://gerrit.wikimedia.org/r/1050373 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [14:15:06] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [14:15:42] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368622 (10phaultfinder) 03NEW [14:15:43] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368621 (10phaultfinder) 03NEW [14:16:38] (03PS1) 10Hnowlan: shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241) [14:16:44] (03CR) 10Muehlenhoff: [C:03+2] Add ripgrep and fd-find to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1050257 (owner: 10Muehlenhoff) [14:17:09] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 296 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:19:17] (03PS3) 10Hashar: Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) [14:20:21] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-worker1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:09] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:24:47] (03PS1) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) [14:25:27] (03PS1) 10Hnowlan: group0, group1: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) [14:25:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one final nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [14:26:34] (03CR) 10CI reject: [V:04-1] Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:26:42] (03PS1) 10Ayounsi: Homer wmf-netbox: fix Netbox 4 breaking changes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) [14:27:42] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9930702 (10CDanis) @Urbanecm what's the status of the stewards VM / onboarding tool? [14:28:33] (03PS1) 10Kamila Součková: Revert "benthos/mw_accesslog_metrics: test for T367076" [puppet] - 10https://gerrit.wikimedia.org/r/1050380 [14:31:10] (03CR) 10Clément Goubert: [C:03+1] httpbb: Auth realm for pushing to docker registry has changed [puppet] - 10https://gerrit.wikimedia.org/r/1050371 (https://phabricator.wikimedia.org/T332016) (owner: 10JMeybohm) [14:31:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:31:46] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9930728 (10Urbanecm) >>! In T343377#9930702, @CDanis wrote: > @Urbanecm what's the status of the stewards VM / onboarding tool? The VM is up an... [14:32:04] (03CR) 10Kamila Součková: [C:03+2] Revert "benthos/mw_accesslog_metrics: test for T367076" [puppet] - 10https://gerrit.wikimedia.org/r/1050380 (owner: 10Kamila Součková) [14:32:40] (03PS2) 10Kamila Součková: benthos/mw_accesslog_metrics: more batching [puppet] - 10https://gerrit.wikimedia.org/r/1050367 (https://phabricator.wikimedia.org/T367076) [14:32:44] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:32:47] (03CR) 10Kamila Součková: benthos/mw_accesslog_metrics: more batching [puppet] - 10https://gerrit.wikimedia.org/r/1050367 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [14:32:59] (03PS2) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) [14:34:07] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9930734 (10CDanis) That sounds good to me, maybe check in with @SLyngshede-WMF about the LDAP sync (assuming he's taken that on while @MoritzMue... [14:34:51] (03CR) 10CI reject: [V:04-1] Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:36:04] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [14:36:21] (03CR) 10Kamila Součková: [C:03+2] benthos/mw_accesslog_metrics: more batching [puppet] - 10https://gerrit.wikimedia.org/r/1050367 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [14:36:48] jouncebot nowandnext [14:36:48] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [14:36:48] In 0 hour(s) and 23 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1500) [14:37:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365988 - depool es1037', diff saved to https://phabricator.wikimedia.org/P65531 and previous config saved to /var/cache/conftool/dbconfig/20240627-143741-arnaudb.json [14:37:48] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [14:38:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es1037.eqiad.wmnet with reason: T365988 [14:38:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1037.eqiad.wmnet with reason: T365988 [14:39:19] (03CR) 10Hashar: "recheck after making the job to capture `*.js` files ( https://gerrit.wikimedia.org/r/c/integration/config/+/1050384 )" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [14:40:39] (03CR) 10Vgutierrez: [C:03+1] "hmm it looks like we could remove cp4038.yaml entirely? besides that, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:41:03] (03PS1) 10Clément Goubert: Reimage 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050386 (https://phabricator.wikimedia.org/T351074) [14:42:07] (03PS4) 10Fabfur: benthos:cache: using tcp proto for syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) [14:43:31] (03CR) 10Kamila Součková: [C:03+1] Reimage 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050386 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:43:50] i'll be doing a minor phabricator update shortly. [14:43:51] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9930763 (10Urbanecm) @SLyngshede-WMF curious to hear what possibilities do we have for automatically granting LDAP access from `stewards1001`? W... [14:43:54] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [14:44:08] (03CR) 10Clément Goubert: [C:03+2] Reimage 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1050386 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:44:51] (03CR) 10Elukey: mcrouter: upgrade to Bookworm (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:46:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1359 to wikikube-worker1022 [14:46:19] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:46:22] (03CR) 10Fabfur: [C:03+2] benthos:cache: using tcp proto for syslog messages from haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1050361 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:46:25] (03PS7) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [14:46:25] (03PS7) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [14:46:25] (03PS7) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [14:46:26] (03PS7) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [14:46:29] (03PS8) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [14:46:33] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-e7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e7-eqiad [14:46:33] (03PS7) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [14:46:43] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [14:46:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-e7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e7-eqiad [14:46:47] (03CR) 10Elukey: [V:03+2 C:03+2] prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:46:57] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930774 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=66810f76-0e2d-43f3-8c96-bbfe4e6a7aee) set by cmooney... [14:46:59] (03CR) 10Elukey: [V:03+2 C:03+2] service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:47:08] (03CR) 10Elukey: [V:03+2 C:03+2] nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:47:18] (03CR) 10Elukey: [V:03+2 C:03+2] echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:47:30] (03CR) 10Elukey: [V:03+2 C:03+2] cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:48:13] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash200[123] - https://phabricator.wikimedia.org/T368327#9930781 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:48:59] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1359 to wikikube-worker1022 - cgoubert@cumin1002" [14:49:30] (03CR) 10JMeybohm: [C:03+2] httpbb: Auth realm for pushing to docker registry has changed [puppet] - 10https://gerrit.wikimedia.org/r/1050371 (https://phabricator.wikimedia.org/T332016) (owner: 10JMeybohm) [14:50:12] (03CR) 10Hashar: [C:03+2] Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [14:50:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1359 to wikikube-worker1022 - cgoubert@cumin1002" [14:50:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:25] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1022 [14:50:28] !log brennen@deploy1002 Started deploy [phabricator/deployment@0df351e]: test deploy phab2002 [14:51:03] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0df351e]: test deploy phab2002 (duration: 00m 34s) [14:51:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1022 [14:51:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1359 to wikikube-worker1022 [14:51:40] (03PS8) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [14:51:40] (03PS1) 10Elukey: prometheus-exporters: fix changelog typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1050394 [14:52:00] (03CR) 10Elukey: [V:03+2 C:03+2] prometheus-exporters: fix changelog typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1050394 (owner: 10Elukey) [14:52:00] !log brennen@deploy1002 Started deploy [phabricator/deployment@0df351e]: deploy phab1004 for minor update [14:52:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1365 to wikikube-worker1023 [14:52:11] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:52:33] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0df351e]: deploy phab1004 for minor update (duration: 00m 32s) [14:52:55] (03PS1) 10Hashar: Add image-diff plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050395 (https://phabricator.wikimedia.org/T341291) [14:53:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy1003.eqiad.wmnet with OS bullseye [14:53:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye executed with error... [14:53:22] (03CR) 10CI reject: [V:04-1] Add image-diff plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050395 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [14:54:36] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1365 to wikikube-worker1023 - cgoubert@cumin1002" [14:55:36] (03CR) 10BCornwall: [C:03+2] cp5021: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049172 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [14:55:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1365 to wikikube-worker1023 - cgoubert@cumin1002" [14:55:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:03] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1023 [14:56:40] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@483e8c3] (eqiad): Bump kartotherian src to latest master [14:57:20] (03PS1) 10David Caro: cloudcephosd1007: update interfaces after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1050397 (https://phabricator.wikimedia.org/T309789) [14:57:33] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-e7-eqiad,lsw1-e7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e7-eqiad [14:57:44] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS bullseye [14:57:49] (03CR) 10David Caro: [C:03+2] cloudcephosd1007: update interfaces after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1050397 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [14:57:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-e7-eqiad,lsw1-e7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e7-eqiad [14:57:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1023 [14:57:54] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9930838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS b... [14:57:57] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930839 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2863d158-d71c-4317-a811-4dd3cb8e6e72) set by cmooney... [14:57:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1365 to wikikube-worker1023 [14:58:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on an-worker[1163-1165].eqiad.wmnet,es1037.eqiad.wmnet,ms-be1078.eqiad.wmnet with reason: JunOS upgrade lsw1-e7-eqiad [14:58:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on an-worker[1163-1165].eqiad.wmnet,es1037.eqiad.wmnet,ms-be1078.eqiad.wmnet with reason: JunOS upgrade lsw1-e7-eqiad [14:58:49] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930845 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bd008f08-7b85-4b69-ba4e-5d84a9307d79) set by cmooney... [14:59:50] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@483e8c3] (eqiad): Bump kartotherian src to latest master (duration: 03m 10s) [15:00:03] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@483e8c3] (codfw): Bump kartotherian src to latest master [15:00:04] jeena and jnuche: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1500) [15:00:05] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1366.eqiad.wmnet [15:00:13] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1366.eqiad.wmnet [15:00:20] !log rebooting lsw1-e7-eqiad to upgrade JunOS on switch T365988 [15:00:32] (03PS2) 10Hashar: Add image-diff JavaScript plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050395 (https://phabricator.wikimedia.org/T341291) [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:38] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [15:01:05] (03Merged) 10jenkins-bot: Add image-diff plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050334 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [15:01:07] (03PS1) 10Arnaudb: bashrc: adds alias for ripgrep [puppet] - 10https://gerrit.wikimedia.org/r/1050398 [15:01:11] (03CR) 10Ssingh: [V:03+1] "Thanks for the review! I will merge this next week." [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [15:02:51] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@483e8c3] (codfw): Bump kartotherian src to latest master (duration: 02m 49s) [15:03:10] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bullseye [15:03:13] (03CR) 10Hashar: [C:03+2] Add image-diff JavaScript plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050395 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [15:03:47] (03Merged) 10jenkins-bot: Add image-diff JavaScript plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050395 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [15:04:16] (03CR) 10Arnaudb: mariadb: add monitoring on io pressure for mariadb hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [15:04:32] !log hashar@deploy1002 Started deploy [gerrit/gerrit@9652bc3]: Add image-diff JavaScript plugin - T341291 [15:04:38] T341291: Install gerrit image-diff plugin - https://phabricator.wikimedia.org/T341291 [15:04:40] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@9652bc3]: Add image-diff JavaScript plugin - T341291 (duration: 00m 07s) [15:08:00] (03CR) 10JMeybohm: [C:03+1] opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [15:08:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1404 to wikikube-worker1026 [15:08:37] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:08:49] (03PS1) 10Hashar: Revert "Add image-diff JavaScript plugin" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050399 (https://phabricator.wikimedia.org/T341291) [15:09:00] (03CR) 10Hashar: [C:03+2] Revert "Add image-diff JavaScript plugin" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050399 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [15:09:31] (03Merged) 10jenkins-bot: Revert "Add image-diff JavaScript plugin" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050399 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [15:09:44] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1373.eqiad.wmnet [15:09:55] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1373.eqiad.wmnet [15:09:55] !log hashar@deploy1002 Started deploy [gerrit/gerrit@8c94fee]: Revert "Add image-diff JavaScript plugin" [15:10:02] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@8c94fee]: Revert "Add image-diff JavaScript plugin" (duration: 00m 07s) [15:10:08] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:10:17] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:11:36] 06SRE: Download of Azure cloud ranges for requestctl is broken - https://phabricator.wikimedia.org/T367269#9930895 (10Joe) 05Open→03Resolved Uh this task was solved on that day, not sure why I forgot to close it. Sorry Andre if this messes with your UBN stats :) [15:12:46] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1404 to wikikube-worker1026 - cgoubert@cumin1002" [15:13:19] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368622#9930913 (10VRiley-WMF) a:03VRiley-WMF [15:13:34] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368622#9930916 (10VRiley-WMF) Reseated cabled. It is now recognized [15:13:42] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368622#9930918 (10VRiley-WMF) 05Open→03Resolved [15:14:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1404 to wikikube-worker1026 - cgoubert@cumin1002" [15:14:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:06] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1026 [15:15:32] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:34] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368621#9930933 (10VRiley-WMF) a:03VRiley-WMF [15:15:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1026 [15:15:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1404 to wikikube-worker1026 [15:16:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1366 to wikikube-worker1024 [15:16:01] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368621#9930939 (10VRiley-WMF) 05Open→03Resolved Reseated cable and it came back on. [15:16:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:17:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9930942 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [15:17:15] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1007.eqiad.wmnet [15:18:41] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1366 to wikikube-worker1024 - cgoubert@cumin1002" [15:19:02] !log T368451 mwmaint1002: Ran `mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Agustín_Antonio_Cardozo' 'Agustín_Cardozo_Cabrera’ [15:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:08] T368451: Unblock stuck global rename of Agustín Cardozo Cabrera - https://phabricator.wikimedia.org/T368451 [15:19:40] (03PS1) 10Fabfur: benthos:cache: apparently linter has lost the ability to skip envvar [puppet] - 10https://gerrit.wikimedia.org/r/1050401 (https://phabricator.wikimedia.org/T365718) [15:20:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1366 to wikikube-worker1024 - cgoubert@cumin1002" [15:20:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:00] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1024 [15:20:11] 10ops-codfw, 06SRE, 06DC-Ops: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9930964 (10Jhancock.wm) 05Open→03Resolved [15:20:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9930984 (10VRiley-WMF) [15:20:49] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930989 (10cmooney) Upgrade completed, all looking good network-wise. [15:21:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1024 [15:21:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 5%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65532 and previous config saved to /var/cache/conftool/dbconfig/20240627-152107-arnaudb.json [15:21:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1366 to wikikube-worker1024 [15:21:14] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [15:21:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: Decommission ganeti1019 - https://phabricator.wikimedia.org/T368597#9931005 (10VRiley-WMF) 05Open→03Resolved Removed the server and ran the decom script. [15:21:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1373 to wikikube-worker1025 [15:21:41] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:22:51] (03PS1) 10Ahmon Dancy: profile::gitlab::runner: Add defaults for some arguments [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) [15:23:27] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1007.eqiad.wmnet [15:23:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050401 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [15:24:10] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1373 to wikikube-worker1025 - cgoubert@cumin1002" [15:25:00] !log T367901 mwmaint1002: Ran `mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=rowiki --logwiki=metawiki 'Rui_Filipe_Fernandes' '44_Gabriel’` [15:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:09] T367901: Unblock stuck global rename of "Rui Filipe Fernandes" to "44 Gabriel" - https://phabricator.wikimedia.org/T367901 [15:25:37] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1022.eqiad.wmnet on all recursors [15:25:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1022.eqiad.wmnet on all recursors [15:25:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1022.eqiad.wmnet with OS bullseye [15:26:30] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1023.eqiad.wmnet on all recursors [15:26:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1023.eqiad.wmnet on all recursors [15:26:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1023.eqiad.wmnet with OS bullseye [15:27:19] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1024.eqiad.wmnet on all recursors [15:27:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1024.eqiad.wmnet on all recursors [15:27:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1024.eqiad.wmnet with OS bullseye [15:27:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1373 to wikikube-worker1025 - cgoubert@cumin1002" [15:27:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:48] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1025 [15:28:22] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1026.eqiad.wmnet on all recursors [15:28:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1026.eqiad.wmnet on all recursors [15:28:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1026.eqiad.wmnet with OS bullseye [15:28:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1025 [15:28:53] (03PS2) 10Ahmon Dancy: profile::gitlab::runner: Add defaults for some arguments [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) [15:28:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1373 to wikikube-worker1025 [15:29:18] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1025.eqiad.wmnet on all recursors [15:29:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1025.eqiad.wmnet on all recursors [15:29:26] (03CR) 10Fabfur: [C:03+2] benthos:cache: apparently linter has lost the ability to skip envvar [puppet] - 10https://gerrit.wikimedia.org/r/1050401 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [15:29:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1025.eqiad.wmnet with OS bullseye [15:29:51] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:32:11] (03CR) 10Ssingh: [C:03+1] benthos:cache: apparently linter has lost the ability to skip envvar [puppet] - 10https://gerrit.wikimedia.org/r/1050401 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [15:32:27] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9931101 (10MoritzMuehlenhoff) One thing that we could do is to - Write a script which parses the complete stewards list from $DATASOURCE and r... [15:34:45] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 29 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:35:23] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9931116 (10MoritzMuehlenhoff) Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't... [15:36:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65533 and previous config saved to /var/cache/conftool/dbconfig/20240627-153613-arnaudb.json [15:36:19] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [15:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:36] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1040335 (owner: 10Ncmonitor) [15:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:39:10] eventgate? [15:39:19] looks like it [15:40:09] quite a few unhappy ferms atm [15:40:10] bunch of ferm services not correct in codfw [15:40:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage [15:40:53] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9931141 (10Eevans) >>! In T365988#9930989, @cmooney wrote: > Upgrade completed, all looking good network-wise. Thanks @cmooney;... [15:41:16] claime: I can restart them [15:41:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1023.eqiad.wmnet with reason: host reimage [15:41:21] hnowlan: yeah go ahead [15:41:33] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm [15:41:43] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirtlocal1002 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050357 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [15:42:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1024.eqiad.wmnet with reason: host reimage [15:42:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1026.eqiad.wmnet with reason: host reimage [15:43:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage [15:43:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:43:32] !log restarted ferm on 8 failing k8s workers [15:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1025.eqiad.wmnet with reason: host reimage [15:45:02] (03PS1) 10Bartosz Dziewoński: FixTrailingWhitespaceIds: Don't crash on complex conflicts [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) [15:46:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1024.eqiad.wmnet with reason: host reimage [15:46:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) (owner: 10Bartosz Dziewoński) [15:47:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 196547272 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:48:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1023.eqiad.wmnet with reason: host reimage [15:49:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 33240 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:51:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 25%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65534 and previous config saved to /var/cache/conftool/dbconfig/20240627-155118-arnaudb.json [15:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:29] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [15:51:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1025.eqiad.wmnet with reason: host reimage [15:55:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1026.eqiad.wmnet with reason: host reimage [15:56:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dbproxy2005 [15:56:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbproxy2005 [15:56:24] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5021.eqsin.wmnet with OS bullseye [15:56:37] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9931221 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bulls... [15:56:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS bullseye [15:56:59] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9931222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS b... [15:58:01] PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:17] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039848 (owner: 10Ncmonitor) [15:58:42] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [15:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:21] o/ [16:00:29] RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [16:00:35] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9931236 (10Sharvaniharan) Hi @Dzahn I updated the email on that account [`Sharvaniharan` to be my work email. Thank you for looking into this :) [16:00:39] (03CR) 10Elukey: [C:03+1] cassandra: remove support for 2.x versions [puppet] - 10https://gerrit.wikimedia.org/r/1050041 (owner: 10Eevans) [16:00:42] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: sync [16:00:45] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync [16:01:43] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [16:01:57] tgr|away: if you come back, happy to deploy your patch [16:02:51] jhathaway: present [16:03:03] my away logic is a bit broken [16:03:18] :), okay merging your patch [16:03:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1022.eqiad.wmnet with OS bullseye [16:03:42] (03CR) 10JHathaway: [C:03+2] [beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [16:05:12] tgr|away: merged [16:05:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1024.eqiad.wmnet with OS bullseye [16:06:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 50%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65535 and previous config saved to /var/cache/conftool/dbconfig/20240627-160624-arnaudb.json [16:06:30] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [16:06:35] thx! [16:06:58] (03CR) 10Klausman: [V:03+1 C:03+2] httpbb/liftwing: Actually deploy split-out file from change 1049943 [puppet] - 10https://gerrit.wikimedia.org/r/1050328 (owner: 10Klausman) [16:07:40] jhathaway: Would you be willing to process https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050402 ? [16:07:52] (unbreaks puppet for a bunch of gitlab runners) [16:07:55] eyes [16:09:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1023.eqiad.wmnet with OS bullseye [16:10:23] dancy: looks reasonable, does it need a code review from someone in sre-collab? [16:10:26] (03PS1) 10Klausman: Revert "httpbb/liftwing: Actually deploy split-out file from change 1049943" [puppet] - 10https://gerrit.wikimedia.org/r/1050413 [16:10:59] (03CR) 10Klausman: [C:03+2] httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1050355 (owner: 10Klausman) [16:11:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1025.eqiad.wmnet with OS bullseye [16:11:56] jhathaway: Jelto reviewed and merged the original change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049268). I think he'd be okay w/ this update (it is functionally equivalent to the pre-changes behavior). [16:12:21] good enough for me, merging [16:12:27] (03CR) 10JHathaway: [C:03+2] profile::gitlab::runner: Add defaults for some arguments [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [16:12:32] 06SRE, 06Infrastructure-Foundations: Update pxelinux in tftpboot environment - https://phabricator.wikimedia.org/T367970#9931277 (10elukey) [16:12:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 121664296 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:12:42] Thanks! I'll run the puppet agent on one of the problem nodes. [16:13:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:13:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1026.eqiad.wmnet with OS bullseye [16:13:46] klausman: let me know when your puppet merge is done [16:13:58] it is now :) [16:14:03] woohoo, thanks [16:14:15] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:56] dancy: merged, test away [16:15:00] thx [16:17:50] jhathaway: Hmm.. behavior is unchanged (Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::gitlab::runner::buildkitd_dockerfile_frontend_enabled' (file: /srv/puppet_code/environments/production/modules/profile/manifests/gitlab/runner.pp, line: 43) on node [16:17:51] runner-1029.gitlab-runners.eqiad1.wikimedia.cloud). [16:18:15] hrm [16:18:35] !log homer 'cr*eqiad*' commit 'T351074' [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:44] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:21:18] (03PS1) 10Catrope: Revert "beta: Disable Graph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 [16:21:24] (03PS2) 10Catrope: Revert "beta: Disable Graph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 [16:21:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65536 and previous config saved to /var/cache/conftool/dbconfig/20240627-162129-arnaudb.json [16:21:35] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [16:21:36] (03Abandoned) 10Klausman: Revert "httpbb/liftwing: Actually deploy split-out file from change 1049943" [puppet] - 10https://gerrit.wikimedia.org/r/1050413 (owner: 10Klausman) [16:22:39] (03PS3) 10Catrope: Revert "beta: Disable Graph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 [16:22:43] jhathaway: It's working now! [16:22:51] (03CR) 10Catrope: [C:03+2] Revert "beta: Disable Graph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 (owner: 10Catrope) [16:23:17] great [16:23:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 (owner: 10Catrope) [16:23:32] (03Merged) 10jenkins-bot: Revert "beta: Disable Graph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050414 (owner: 10Catrope) [16:26:35] (03PS14) 10Cathal Mooney: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [16:27:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm [16:28:30] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9931318 (10dcaro) @CDanis I'm reimaging another osd node, so some more load is being applied, I'm not seeing any iss... [16:29:39] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [16:30:38] (03PS1) 10CDobbins: purged: set use_pkie to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [16:30:39] (03PS1) 10Ahmon Dancy: buildkitd/gitlab-runner: Default allowed gateway sources to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1050418 (https://phabricator.wikimedia.org/T367352) [16:31:09] (03PS2) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [16:31:27] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050418 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [16:32:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [16:34:19] 06SRE, 06Infrastructure-Foundations, 06serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9931347 (10Clement_Goubert) 05Resolved→03Open p:05Medium→03High This issue is biting us again, the time between a pu... [16:34:37] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1050418/3846/" [puppet] - 10https://gerrit.wikimedia.org/r/1050418 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [16:35:07] !log Pooling and uncordoning wikikube-worker1022.eqiad.wmnet,wikikube-worker1023.eqiad.wmnet,wikikube-worker1024.eqiad.wmnet,wikikube-worker1025.eqiad.wmnet,wikikube-worker1026.eqiad.wmnet - T351074 [16:35:08] jhathaway: One more fix is needed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050418 [16:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:13] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:35:16] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1022.eqiad.wmnet|wikikube-worker1023.eqiad.wmnet|wikikube-worker1024.eqiad.wmnet|wikikube-worker1025.eqiad.wmnet|wikikube-worker1026.eqiad.wmnet),cluster=kubernetes,service=kubesvc [16:36:02] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T368639 (10Clement_Goubert) 03NEW [16:36:29] dancy: looking [16:36:32] (03CR) 10JHathaway: [C:03+2] buildkitd/gitlab-runner: Default allowed gateway sources to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1050418 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [16:36:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65537 and previous config saved to /var/cache/conftool/dbconfig/20240627-163635-arnaudb.json [16:36:41] T365988: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988 [16:36:52] (03PS1) 10Hashar: Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) [16:38:16] dancy: merged [16:38:18] (03CR) 10Dzahn: [C:03+2] admin: convert dmuthuri from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050049 (https://phabricator.wikimedia.org/T367872) (owner: 10Dzahn) [16:38:37] (03CR) 10Cathal Mooney: Add DSCP marking options to current firewall classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:41:12] (03PS15) 10Cathal Mooney: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [16:42:10] jhathaway: Everything's looking good now. Thanks for your help. [16:42:22] great! [16:42:45] (03PS1) 10Dreamy Jazz: Remove modifications of wgCheckUserLogAdditionalRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050424 (https://phabricator.wikimedia.org/T346022) [16:42:55] (03CR) 10CI reject: [V:04-1] Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [16:42:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9931405 (10Dzahn) [16:43:14] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3103/c" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:44:19] (03PS2) 10Hashar: Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) [16:44:21] (03CR) 10Dzahn: [C:03+2] admin: convert kgraessle from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) (owner: 10Dzahn) [16:44:41] (03PS3) 10Dzahn: admin: convert kgraessle from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) [16:45:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9931412 (10Dzahn) 05In progress→03Resolved You have now been added to the group as requested. Feel free to try the Superset acc... [16:47:49] jouncebot: nowandnext [16:47:50] For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1600) [16:47:50] In 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1700) [16:47:50] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1700) [16:49:13] (03CR) 10Dzahn: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) (owner: 10Dzahn) [16:49:37] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirtlocal1003 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050358 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [16:50:17] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm [16:54:15] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:54:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [16:55:01] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [16:55:15] (03PS1) 10JHathaway: wikimedia.org: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [16:55:51] (03PS2) 10JHathaway: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [16:56:26] (03PS3) 10JHathaway: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [16:58:10] (03CR) 10Hashar: [C:03+2] Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [16:58:10] (03PS1) 10Hashar: Add image-diff JavaScript plugin (take 2) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050428 (https://phabricator.wikimedia.org/T341291) [16:59:53] (03CR) 10Hashar: [C:03+2] Add image-diff JavaScript plugin (take 2) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050428 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [17:00:05] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1700). nyaa~ [17:00:05] swfrench-wmf: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1700). [17:01:22] (03CR) 10Dzahn: [C:03+2] admin: add dreamyjazz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050069 (https://phabricator.wikimedia.org/T368260) (owner: 10Dzahn) [17:01:28] (03PS2) 10Dzahn: admin: add dreamyjazz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050069 (https://phabricator.wikimedia.org/T368260) [17:01:56] (03PS2) 10Scott French: mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) [17:01:56] (03PS6) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [17:02:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9931515 (10elukey) @Papaul I tried to use the redfish endpoint for sretest2001, but I get unauthorized for most of the calls: ` >>> spicerack_redfish.request("get", "/r... [17:02:45] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1220811984 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:02:52] here. rebasing to pick up more recent changes, but should be ready to start soon. [17:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:04:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9931522 (10Dzahn) 05In progress→03Resolved You have now been added to the group as requested. Feel free to test the Superset a... [17:04:45] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:05:03] nothing for me to push out this week. [17:05:04] (03CR) 10Muehlenhoff: [C:03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:06:03] (03CR) 10Dzahn: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1050069 (https://phabricator.wikimedia.org/T368260) (owner: 10Dzahn) [17:06:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5021.eqsin.wmnet with OS bullseye [17:06:19] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9931532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bulls... [17:06:43] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [17:07:04] (03PS1) 10Andrew Bogott: Move 5 cloudvirts to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050430 (https://phabricator.wikimedia.org/T364457) [17:07:30] (03CR) 10Scott French: [C:03+2] mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:09:06] (03Merged) 10jenkins-bot: mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:09:26] (03CR) 10CI reject: [V:04-1] Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [17:09:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9931555 (10Dzahn) [17:09:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [17:09:54] (03Merged) 10jenkins-bot: Add image-diff JavaScript plugin (take 2) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050428 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [17:10:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9931539 (10Dzahn) 05In progress→03Resolved You have now been added to the group as requested. Feel free to test the ne... [17:11:06] 06SRE, 06Infrastructure-Foundations, 06serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9931559 (10MoritzMuehlenhoff) >>! In T354855#9931347, @Clement_Goubert wrote: > This issue is biting us again, the time betw... [17:11:53] (03CR) 10Dzahn: "sorry for removing the ticket. What I wanted to do was to edit the topic but since a recent Gerrit version you can't set the topic on othe" [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [17:12:05] (03CR) 10Dzahn: "s/ticket/topic" [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [17:13:04] (03PS4) 10JHathaway: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [17:13:34] !log hashar@deploy1002 Started deploy [gerrit/gerrit@8c6ae73]: Add image-diff JavaScript plugin (take 2) - T341291 [17:13:40] T341291: Install gerrit image-diff plugin - https://phabricator.wikimedia.org/T341291 [17:13:41] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@8c6ae73]: Add image-diff JavaScript plugin (take 2) - T341291 (duration: 00m 07s) [17:14:10] (03CR) 10CI reject: [V:04-1] temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [17:14:23] (03CR) 10Dzahn: "Mcastro approved on the ticket but that account is not linked to a WMF SUL account yet as pointed out by Andre" [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [17:14:30] (03CR) 10Hashar: [C:03+2] Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [17:14:58] !log swfrench@deploy1002 Started scap: (no justification provided) [17:15:34] (03PS1) 10DLynch: Enable DiscussionTools permalinks on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050432 (https://phabricator.wikimedia.org/T365974) [17:16:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050432 (https://phabricator.wikimedia.org/T365974) (owner: 10DLynch) [17:17:44] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bookworm [17:18:51] (03PS1) 10Hashar: Revert "Add image-diff JavaScript plugin (take 2)" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050433 [17:18:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS bookworm [17:18:58] (03CR) 10Hashar: [C:03+2] Revert "Add image-diff JavaScript plugin (take 2)" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050433 (owner: 10Hashar) [17:19:01] (03PS5) 10JHathaway: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [17:19:15] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bookworm [17:19:26] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bookworm [17:19:27] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bookworm [17:22:15] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9931607 (10Dzahn) Thank you! That looks all good now. I'll upload the change to get you added to the group. [17:22:27] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9931608 (10Dzahn) a:05Sharvaniharan→03Dzahn [17:23:02] !log swfrench@deploy1002 Finished scap: (no justification provided) (duration: 08m 03s) [17:23:04] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9931609 (10Dzahn) 05Open→03In progress p:05Triage→03High [17:24:20] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:25:58] (03Merged) 10jenkins-bot: Handle image-diff external dependencies [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050420 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [17:25:59] (03Merged) 10jenkins-bot: Revert "Add image-diff JavaScript plugin (take 2)" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1050433 (owner: 10Hashar) [17:26:18] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7659481]: Revert Add image-diff JavaScript plugin (take 2) [17:26:19] !log hashar@deploy1002 deploy aborted: Revert Add image-diff JavaScript plugin (take 2) (duration: 00m 00s) [17:26:26] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7659481]: Revert "Add image-diff JavaScript plugin (take 2)" [17:26:33] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7659481]: Revert "Add image-diff JavaScript plugin (take 2)" (duration: 00m 07s) [17:27:56] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9931633 (10Dzahn) Hello @Sharvaniharan I noticed you already have shell access so no code change is actually needed. Then I was about to add you to the "wmf" LDAP group and noticed you also a... [17:28:51] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 98219200 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:29:08] (03CR) 10Andrew Bogott: [C:03+2] Move 5 cloudvirts to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050430 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [17:29:51] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 116232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:32:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:32:22] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:32:45] (03CR) 10Ssingh: "Looks good! There are per-host overrides for cp5017 and cp4052 that we can also remove in this commit." [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:33:05] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:33:06] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:33:27] !log canary deployments are healthy, slow-logs still produced, continuing with main deployments for T362978 [17:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:32] T362978: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978 [17:33:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:33:32] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [17:33:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:33:59] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [17:34:01] (03CR) 10Scott French: [C:03+2] mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:34:09] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [17:34:10] (03CR) 10Ssingh: "hieradata/cp4052.yaml, etc" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:34:22] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [17:34:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm [17:34:55] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [17:35:46] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:35:48] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:36:07] (03Merged) 10jenkins-bot: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:36:59] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [17:37:21] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:37:23] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:39:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [17:41:30] !log swfrench@deploy1002 Started scap: Deploying securityContext changes for T362978 to main release [17:41:38] T362978: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978 [17:43:10] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [17:43:25] (03CR) 10Eevans: [C:03+2] cassandra: remove support for 2.x versions [puppet] - 10https://gerrit.wikimedia.org/r/1050041 (owner: 10Eevans) [17:45:39] !log swfrench@deploy1002 Finished scap: Deploying securityContext changes for T362978 to main release (duration: 04m 09s) [17:47:25] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [17:50:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [17:51:44] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet [17:58:52] (03PS34) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [17:59:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1057.eqiad.wmnet with OS bookworm [18:00:05] jeena and jnuche: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T1800). [18:00:16] o/ [18:02:12] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9931829 (10Dzahn) a:05Dzahn→03SLyngshede-WMF Hi Simon, could you take a look? [18:04:05] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050440 (https://phabricator.wikimedia.org/T366956) [18:04:07] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050440 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:04:45] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050440 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:04:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS bookworm [18:04:56] (03PS1) 10Kosta Harlan: QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) [18:05:36] (03CR) 10CI reject: [V:04-1] QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [18:06:20] (03CR) 10Bking: [C:03+1] sre.hosts.reimage: Only print 'starting reimage' when it starts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1046668 (owner: 10Majavah) [18:06:50] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9931839 (10Dzahn) I can't really confirm that from my side right now. When I click on that it's fast. Maybe it's only slow sometimes while someth... [18:08:04] (03PS2) 10Kosta Harlan: QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) [18:08:36] (03Abandoned) 10Bartosz Dziewoński: Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [18:08:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS bookworm [18:10:24] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5022.eqsin.wmnet [18:10:29] (03CR) 10BCornwall: [C:03+2] cp5022: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049173 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [18:11:10] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:11:18] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9931845 (10BCornwall) [18:11:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [18:11:44] (03Abandoned) 10Dzahn: peopleweb: set profile::firewall::defs_from_etcd to false [puppet] - 10https://gerrit.wikimedia.org/r/1050080 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [18:11:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [18:12:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS bookworm [18:12:15] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9931848 (10VRiley-WMF) 05Open→03In progress I will now be proceeding with swapping the entire server again. I will be using a different server in hopes that it should boot up. [18:12:32] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.11 refs T366956 [18:12:38] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [18:12:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:12:57] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [18:14:25] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1067.eqiad.wmnet with OS bookworm [18:14:49] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9931860 (10mforns) > @SGupta-WMF or @mforns - One additional request: if one o... [18:15:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:15:57] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368648 (10phaultfinder) 03NEW [18:18:51] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [18:18:57] T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033 [18:19:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [18:19:10] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9931875 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cc0c33c0-ef80-4a74-941e-aab16294505c) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r... [18:19:11] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5022.eqsin.wmnet with OS bullseye [18:19:22] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9931877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS b... [18:19:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:19:53] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [18:29:15] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:36] (03PS1) 10JHathaway: lower MX TTLs prior to adding mx-in servers [dns] - 10https://gerrit.wikimedia.org/r/1050442 (https://phabricator.wikimedia.org/T367517) [18:36:28] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5022.eqsin.wmnet with OS bullseye [18:36:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5022.eqsin.wmnet with OS bullseye [18:36:53] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368656 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:39:14] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:35] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9932017 (10VRiley-WMF) I have swapped the HDD's over to the new server. It looks like it has powered up okay at this point. [18:39:42] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9932018 (10VRiley-WMF) 05In progress→03Open [18:41:50] (03CR) 10JHathaway: [C:03+2] lower MX TTLs prior to adding mx-in servers [dns] - 10https://gerrit.wikimedia.org/r/1050442 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [18:46:17] (03CR) 10Majavah: [C:03+2] sre.hosts.reimage: Only print 'starting reimage' when it starts [cookbooks] - 10https://gerrit.wikimedia.org/r/1046668 (owner: 10Majavah) [18:49:26] (03PS1) 10Andrew Bogott: Two more cloudvirts to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050444 (https://phabricator.wikimedia.org/T364457) [18:50:03] (03Merged) 10jenkins-bot: sre.hosts.reimage: Only print 'starting reimage' when it starts [cookbooks] - 10https://gerrit.wikimedia.org/r/1046668 (owner: 10Majavah) [18:50:10] (03PS6) 10JHathaway: temporarily add mx-in1001 as an MX server [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) [18:52:23] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm [18:52:57] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932100 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bulls... [18:52:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368656 (10ops-monitoring-bot) 03NEW [18:53:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS b... [18:57:01] (03PS1) 10Ayounsi: Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) [18:57:59] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9932180 (10Eevans) For posterity sake: `lang=sh-session eevans@aqs1013:~$ sudo lshw -class disk *-disk:0 description: ATA Disk product: HFS1T9G32FEH-BA1... [18:58:16] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:59:00] (03CR) 10Dzahn: "the style guide used to say we should never have default values originally or maybe it still does, but I think that ship has sailed a long" [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [19:00:05] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:00:14] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:00:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:07:02] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [19:07:55] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bookworm [19:08:01] (03CR) 10Andrew Bogott: [C:03+2] Two more cloudvirts to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1050444 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [19:09:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [19:09:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9932231 (10Papaul) @elukey like i mentioned on IRC this is a license issue . sretest2001 is using SFT-OOB-LIC for license or kubernetes2054 is using SFT-DCMS-SINGLE for... [19:10:02] (03CR) 10Ahmon Dancy: "Is there a proper hiera file that I can use to set defaults for profile::gitlab::runner::buildkitd_dockerfile_frontend_enabled, profile::g" [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [19:10:27] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [19:10:58] (03CR) 10Ryan Kemper: [C:03+2] query_service: Add Access-Control-Allow-Headers [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [19:13:27] (03CR) 10Dzahn: "hour parameter should be just fine. We added hour and minute and some point, just didn't have it at the start." [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [19:14:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [19:14:28] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932254 (10Scott_French) Thanks, @mforns! Also, I see you hit retry on the fa... [19:16:13] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [19:17:33] (03PS3) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) [19:22:02] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9932277 (10CDanis) Thanks Moritz, that sounds great to me. @Urbanecm are you interested in writing some patches if I do code reviews? [19:23:34] (03CR) 10Dzahn: [C:03+2] Provide weekly Phabricator data for Tech News [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [19:23:45] (03CR) 10CDanis: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1050426 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [19:23:56] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [19:25:22] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9932315 (10MoritzMuehlenhoff) Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations... [19:27:20] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [19:33:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bookworm [19:33:56] (03CR) 10Dzahn: [C:03+2] "timer and service have been created, but: ERROR 2005 (HY000): Unknown MySQL server host '-P' (-2)" [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [19:34:37] (03PS5) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [19:35:04] (03PS6) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [19:35:08] (03PS7) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [19:35:26] (03CR) 10Dzahn: [C:03+2] "typo in name of the included config file. fixing. phab_tech_news_weekly_stats.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [19:37:34] (03PS1) 10Dzahn: phabricator: fix config file name for automated tech news mails [puppet] - 10https://gerrit.wikimedia.org/r/1050451 (https://phabricator.wikimedia.org/T368460) [19:38:34] (03CR) 10Dzahn: [C:03+2] phabricator: fix config file name for automated tech news mails [puppet] - 10https://gerrit.wikimedia.org/r/1050451 (https://phabricator.wikimedia.org/T368460) (owner: 10Dzahn) [19:39:25] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932366 (10mforns) Thanks for the follow up @Scott_French! @SGupta-WMF, I trie... [19:39:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9932373 (10Marostegui) 1G is fine, thanks! [19:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368656#9932377 (10Dzahn) →14Duplicate dup:03T368564 [19:39:58] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368564#9932375 (10Dzahn) [19:40:00] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368564#9932380 (10Dzahn) [19:41:46] (03CR) 10Ryan Kemper: [C:03+1] "This LGTM. Should just be a no-op. I'll wait for someone responsible for maps to give the final go-ahead though, but otherwise this is rea" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [19:43:15] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932401 (10mforns) Also @SGupta-WMF I reviewed the swagger spec, and I found a... [19:44:00] (03PS1) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [19:44:01] (03PS1) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [19:44:21] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9932404 (10Krd) With the curl command you are not logged in, are you? [19:45:41] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9932418 (10Dzahn) That's correct, I am not logged in. [19:46:20] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9932420 (10Krd) With the curl command you are not logged in, are you? [19:47:28] mw-page-content-change flink app is down in eqiad. not sure why. [19:47:36] https://phabricator.wikimedia.org/T368667 [19:47:48] not many folks online. i'm going to attempt to revive it... will log [19:48:01] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5022.eqsin.wmnet with OS bullseye [19:48:13] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bulls... [19:49:31] ottomata LMK if you need help troubleshooting [19:49:41] inflatador: great thank you [19:49:41] i do [19:49:44] been a while for me [19:50:00] it looks like the jobmanager pod is still up [19:50:04] but no more task managers have been made [19:50:05] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [19:50:11] at this moment I'd just like to get it restarted [19:50:16] ottomata ACK, I saw an earlier ticket where this problem happened? [19:50:19] i'm considering deleting the jobmanager pod. [19:50:19] in staging? [19:50:30] hm, maybe. we've seen lots of weird stuff in staging before [19:50:33] but never in eqiad [19:50:51] (03PS1) 10Ryan Kemper: wdqs graph split: new PTR records [dns] - 10https://gerrit.wikimedia.org/r/1050454 (https://phabricator.wikimedia.org/T364364) [19:51:02] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bookworm [19:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:49] inflatador: would you recommend deleting the pod to see if flink-k8s-operator will just restart? [19:51:57] ottomata Y, can do that if you like [19:52:06] i'm in now i think i can [19:52:07] Also this was the other ticket I mentioned T367116 [19:52:08] T367116: mw-page-content-change-enrich flink app is missing in k8s staging - https://phabricator.wikimedia.org/T367116 [19:52:20] in this case the app is still there...just in failed state [19:53:28] !log deleted mw-page-content-change-enrich stuck jobmanager pod: kubectl -n mw-page-content-change-enrich delete pod flink-app-main-859d98c57b-zrgwk - T368667 [19:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:33] T368667: [Event Platform] mw-page-content-change-enrich down in eqiad 2024-06-27 - https://phabricator.wikimedia.org/T368667 [19:53:38] (03PS2) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [19:53:46] looks like it is restarting... [19:54:08] oo error ... [19:54:18] ottomata Y looking too [19:54:51] ah [19:54:51] Ignoring JobGraph submission 'mw-page-content-change-enrich' (47599a716e2be491c27d9849e4991e6e) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution. [19:55:00] (03CR) 10Dzahn: "for production they could be in hieradata/role/common/gitlab_runner.yaml (exists) or in hieradata/common/profile/gitlab/runner.yaml (doesn" [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [19:55:02] i think we have to delete the deployment? [19:55:18] ottomata y, been awhile since once of them got goofy like that [19:55:30] Happy to do that, I think it's a rollout restart or some such [19:55:47] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet [19:55:55] inflatador: please do [19:55:56] thanjk you [19:56:04] ottomata ACK, will get start [19:56:05] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart [19:56:06] ? [19:56:08] oh no [19:56:11] this one [19:56:11] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Undeploy/delete_a_release [19:56:23] (03CR) 10Dzahn: [C:03+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050451 I tested this and let it send the mail to myself and it looks fine to " [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [19:58:59] (03PS3) 10Kosta Harlan: QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) [19:59:08] ottomata done, let's see if things improve [19:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:52] k [19:59:54] (03CR) 10CI reject: [V:04-1] Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [20:00:03] ottomata nope! [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240627T2000) [20:00:05] kemayo and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] o/ [20:00:33] hi [20:00:42] still trying to debug something with my patch, so please go ahead Kemayo [20:00:50] same error inflatador [20:00:51] yeah [20:01:13] ottomata sorry, things are moving too fast in this channel, let's talk in #search [20:01:46] k [20:02:25] Kemayo: do you need a deployer? [20:02:29] I do. [20:02:57] okay, I can deploy [20:03:04] jeena: Thanks! [20:03:39] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [20:03:47] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:04:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050432 (https://phabricator.wikimedia.org/T365974) (owner: 10DLynch) [20:05:02] (03PS1) 10Ahmon Dancy: gitlab::runner: Remove default values for profile::gitlab::runner::buildkitd_* lookups. [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) [20:05:45] (03Merged) 10jenkins-bot: Enable DiscussionTools permalinks on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050432 (https://phabricator.wikimedia.org/T365974) (owner: 10DLynch) [20:05:52] (03PS2) 10Ryan Kemper: wdqs graph split: new PTR records [dns] - 10https://gerrit.wikimedia.org/r/1050454 (https://phabricator.wikimedia.org/T364364) [20:06:02] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1050432|Enable DiscussionTools permalinks on enwiki (T365974)]] [20:06:07] T365974: Deploy talk page permalinks to en.wiki - https://phabricator.wikimedia.org/T365974 [20:08:23] !log jhuneidi@deploy1002 jhuneidi, kemayo: Backport for [[gerrit:1050432|Enable DiscussionTools permalinks on enwiki (T365974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:43] (03CR) 10CI reject: [V:04-1] gitlab::runner: Remove default values for profile::gitlab::runner::buildkitd_* lookups. [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [20:10:57] Kemayo: do you need to do any checks on the testservers? [20:11:26] jeena: I just checked, it looks good. [20:11:55] okay thanks! [20:11:59] !log jhuneidi@deploy1002 jhuneidi, kemayo: Continuing with sync [20:12:09] (03PS2) 10Ahmon Dancy: gitlab::runner: Move defaults for profile::gitlab::runner::buildkitd_* lookups [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) [20:13:40] jeena: I'm ready to deploy whenever that's done. But happy for you to sync the change too :) [20:14:06] kostajh: sure, I can do it after this one :) [20:14:38] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [20:15:57] thx [20:16:37] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5023.eqsin.wmnet [20:16:46] (03CR) 10Ahmon Dancy: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050455" [puppet] - 10https://gerrit.wikimedia.org/r/1050402 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [20:16:57] (03CR) 10BCornwall: [C:03+2] cp5023: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049174 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [20:17:11] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1050432|Enable DiscussionTools permalinks on enwiki (T365974)]] (duration: 11m 09s) [20:17:27] T365974: Deploy talk page permalinks to en.wiki - https://phabricator.wikimedia.org/T365974 [20:17:33] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9932489 (10Jclark-ctr) @BTullis can you update preseed.yam and site.pp file for these servers [20:17:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [20:17:43] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1050455/3848/" [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [20:18:06] kostajh: looks like it needs a rebase [20:18:15] jeena: ok, just a sec [20:18:33] (03PS4) 10Kosta Harlan: QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) [20:18:50] done [20:19:01] (03CR) 10TrainBranchBot: "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [20:19:10] thanks! [20:19:33] (03CR) 10Ottomata: [C:03+2] Bump page_change and page_content_change event schema versions to make performer optional [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047995 (https://phabricator.wikimedia.org/T367923) (owner: 10Ottomata) [20:19:44] (03Merged) 10jenkins-bot: QuickSurveys: Add testing survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050441 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [20:20:00] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1050441|QuickSurveys: Add testing survey configuration (T368459)]] [20:20:08] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [20:20:32] (03Merged) 10jenkins-bot: Bump page_change and page_content_change event schema versions to make performer optional [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047995 (https://phabricator.wikimedia.org/T367923) (owner: 10Ottomata) [20:21:27] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:21:35] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:22:21] !log jhuneidi@deploy1002 kharlan, jhuneidi: Backport for [[gerrit:1050441|QuickSurveys: Add testing survey configuration (T368459)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:05] kostajh: ready for any checks you need to do [20:24:10] jeena: thx, looking [20:24:31] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:24:51] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:25:52] jeena: bah, not working, but let me see if I can find some logs to work out why [20:25:59] ok [20:27:35] gah... QuickSurveys is not enabled on testwiki [20:27:48] * kostajh facepalms [20:27:52] ohh, shall I just go ahead and sync then? [20:28:12] is it ok to sync, and then I make a follow-up which enables on testwiki, and we sync that and verify then? [20:29:05] (03PS1) 10Kosta Harlan: testwiki: Enable QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050460 (https://phabricator.wikimedia.org/T368459) [20:29:26] ^ would enable QuickSurveys on testwiki [20:29:34] oh sure [20:29:46] !log jhuneidi@deploy1002 kharlan, jhuneidi: Continuing with sync [20:30:53] I'll go ahead and +2 your new change [20:31:06] (03CR) 10Jeena Huneidi: [C:03+2] testwiki: Enable QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050460 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [20:31:12] thanks [20:31:15] added to the calendar [20:31:48] (03Merged) 10jenkins-bot: testwiki: Enable QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050460 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [20:31:58] (03PS1) 10Ottomata: mw-page-content-change-enrich Bump image version to pick up latest schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050461 (https://phabricator.wikimedia.org/T367923) [20:34:46] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1050441|QuickSurveys: Add testing survey configuration (T368459)]] (duration: 14m 45s) [20:34:52] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [20:35:32] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1050460|testwiki: Enable QuickSurveys (T368459)]] [20:36:35] (03CR) 10Ottomata: [C:03+2] mw-page-content-change-enrich Bump image version to pick up latest schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050461 (https://phabricator.wikimedia.org/T367923) (owner: 10Ottomata) [20:37:37] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:37:49] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:37:54] !log jhuneidi@deploy1002 kharlan, jhuneidi: Backport for [[gerrit:1050460|testwiki: Enable QuickSurveys (T368459)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:05] kostajh: ready for you on mwdebug [20:38:09] thanks, checking [20:39:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS bullseye [20:40:02] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS b... [20:40:31] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [20:40:40] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:42:02] jeena: will need a few minutes to debug again, sorry [20:42:10] np [20:43:10] (03CR) 10Bking: [C:03+1] wdqs graph split: new PTR records [dns] - 10https://gerrit.wikimedia.org/r/1050454 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [20:44:45] jeena: yay, it works [20:44:51] wahoo [20:44:59] !log jhuneidi@deploy1002 kharlan, jhuneidi: Continuing with sync [20:47:26] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:49:56] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-conf1005 - vriley@cumin1002" [20:50:05] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1050460|testwiki: Enable QuickSurveys (T368459)]] (duration: 14m 33s) [20:50:11] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [20:50:26] all done [20:50:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-conf1005 - vriley@cumin1002" [20:50:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:51:35] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-conf1005 [20:53:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-conf1005 [21:03:20] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9932633 (10VRiley-WMF) [21:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:05:33] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932643 (10Scott_French) Thanks for giving that a try, @mforns ! Looking at `... [21:22:28] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:22:30] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:25:44] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [21:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:26:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156063568 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:28:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [21:29:00] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 35752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:29:43] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#9932687 (10MoritzMuehlenhoff) [21:31:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:32:39] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:42:49] (03CR) 10Dzahn: [C:03+2] gitlab::runner: Move defaults for profile::gitlab::runner::buildkitd_* lookups [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:50:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:50:37] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:50:46] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:50:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:53:03] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:53:25] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:54:54] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:54:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:55:02] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:55:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:55:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:55:37] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:58:07] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:58:08] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:00:36] (03CR) 10Dzahn: [C:03+2] "thanks! noop in prod and cloud project confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [22:01:05] (03CR) 10Ahmon Dancy: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1050455 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [22:02:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5023.eqsin.wmnet with OS bullseye [22:02:33] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932790 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS bulls... [22:04:03] (03CR) 10BCornwall: [C:03+2] cp5024: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049175 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [22:04:41] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932805 (10BCornwall) [22:05:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:05:25] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet [22:09:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:09:57] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:13:29] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:13:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:19:22] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5024.eqsin.wmnet [22:20:20] (03PS1) 10Dzahn: doc: redirect doc.wikimedia.org/analytics-api [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) [22:21:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:24:07] (03PS2) 10Dzahn: doc: redirect doc.wikimedia.org/analytics-api [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) [22:25:33] (03PS3) 10Dzahn: doc: redirect doc.wikimedia.org/analytics-api [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) [22:29:24] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367856)', diff saved to https://phabricator.wikimedia.org/P65539 and previous config saved to /var/cache/conftool/dbconfig/20240627-223142-marostegui.json [22:31:48] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:34:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:34:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:34:44] (03CR) 10Dzahn: [V:03+1 C:03+1] "I tested this on the inactive host doc2002 and using the httpbb test i'm adding here:" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn) [22:36:51] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 23.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:37:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:37:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:39:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:41:53] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:41:59] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:42:26] (03CR) 10BCornwall: acme-chief: Add new certificates and domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047147 (owner: 10BCornwall) [22:42:57] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:43:24] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:43:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye [22:43:59] 10ops-eqsin, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9932922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye [22:46:40] PROBLEM - Disk space on backup2003 is CRITICAL: DISK CRITICAL - free space: /srv/bacula 6222500 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops [22:46:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65540 and previous config saved to /var/cache/conftool/dbconfig/20240627-224649-marostegui.json [22:49:17] (03PS1) 10BCornwall: hiera: Unify eqsin trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T365763) [22:51:24] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9932934 (10Dzahn) [22:55:42] (03PS2) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) [22:56:56] (03PS3) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) [23:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65541 and previous config saved to /var/cache/conftool/dbconfig/20240627-230156-marostegui.json [23:05:33] !log Running `foreachwikiindblist group0.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php` for T366781 [23:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:42] T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781 [23:17:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367856)', diff saved to https://phabricator.wikimedia.org/P65542 and previous config saved to /var/cache/conftool/dbconfig/20240627-231703-marostegui.json [23:17:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:17:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:17:10] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:18:45] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [23:19:11] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [23:24:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [23:24:42] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [23:33:15] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5024.eqsin.wmnet with OS bullseye [23:33:22] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye execu... [23:33:29] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye [23:33:38] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9933092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050483 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050483 (owner: 10TrainBranchBot) [23:44:56] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 [23:48:27] (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [23:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:10] (03CR) 10BCornwall: [C:03+1] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [23:55:09] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [23:55:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [23:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed