[00:02:40] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:06:42] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:08:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073300 (owner: 10TrainBranchBot) [00:28:53] (03CR) 10Dzahn: [V:03+1 C:03+1] "Great work. I tested this and it does render a wiki table. result at https://wikitech.wikimedia.org/wiki/Phabricator/T373952" [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952) (owner: 10Aklapper) [00:30:18] (03PS2) 10Dzahn: phabricator: turn weekly data for tech news into Wikitext [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952) (owner: 10Aklapper) [00:30:20] (03CR) 10Dzahn: [C:03+2] phabricator: turn weekly data for tech news into Wikitext [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952) (owner: 10Aklapper) [00:30:54] (03CR) 10Dzahn: [V:03+2 C:03+2] phabricator: turn weekly data for tech news into Wikitext [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952) (owner: 10Aklapper) [00:36:52] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add gerrit::proxy profile to insetup::gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/1072323 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [00:54:35] (03PS1) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [00:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:54:59] (03CR) 10CI reject: [V:04-1] gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [00:56:16] (03PS2) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [01:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.23 [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073307 (https://phabricator.wikimedia.org/T373642) [01:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.23 [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073307 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [01:08:23] (03PS1) 10Dzahn: gerrit::proxy: fix link target for gerrit logo [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) [01:10:57] (03CR) 10Dzahn: [C:03+1] "Either way "ensure => link" combined with "source =>" seems wrong. Links need targets, files need sources." [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [01:13:46] (03CR) 10Dzahn: [C:03+1] "ah, it's also possible this was the older puppet syntax back in the days in 2016 and we only have the clearer syntax to manage symlinks si" [puppet] - 10https://gerrit.wikimedia.org/r/1073308 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [01:35:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.23 [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073307 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0200) [02:39:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0300) [03:01:30] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073310 (https://phabricator.wikimedia.org/T373642) [03:01:31] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073310 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [03:02:23] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073310 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [03:02:44] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.23 refs T373642 [03:02:48] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [03:19:03] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:29:03] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:37:47] RECOVERY - dump of es6 in eqiad on backupmon1001 is OK: Last dump for es6 at eqiad (es1036) taken on 2024-09-17 00:00:07 (448 GiB, +4.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:43:53] RECOVERY - dump of es7 in eqiad on backupmon1001 is OK: Last dump for es7 at eqiad (es1040) taken on 2024-09-17 00:00:07 (448 GiB, +4.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:53:32] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.23 refs T373642 (duration: 50m 47s) [03:53:37] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0400) [04:00:59] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.20 (duration: 00m 58s) [04:03:47] RECOVERY - dump of es6 in codfw on backupmon1001 is OK: Last dump for es6 at codfw (es2036) taken on 2024-09-17 00:00:00 (448 GiB, +4.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:53:51] RECOVERY - dump of es7 in codfw on backupmon1001 is OK: Last dump for es7 at codfw (es2040) taken on 2024-09-17 00:00:00 (448 GiB, +4.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:05:44] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (0339 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [05:58:49] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp1101 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:59:49] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp1101 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0600) [06:00:05] marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:07] PROBLEM - MD RAID on puppetmaster1003 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:08:09] ACKNOWLEDGEMENT - MD RAID on puppetmaster1003 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T374901 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:08:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901 (10ops-monitoring-bot) 03NEW [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:35:45] (03CR) 10Arthur taylor: [C:03+1] "Looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073247 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0700). nyaa~ [07:00:04] hashar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:27] (03PS1) 10Muehlenhoff: Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1073321 (https://phabricator.wikimedia.org/T374901) [07:05:12] o/ [07:05:53] I will deploy the sole patch that has been scheduled for this window [07:06:03] in some minutes, I am late on my coffee and morning routine [07:08:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:09:55] gr [07:10:09] (03PS2) 10Hashar: logging: Default to log any error (on group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) [07:11:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2121 T374845', diff saved to https://phabricator.wikimedia.org/P69198 and previous config saved to /var/cache/conftool/dbconfig/20240917-071120-arnaudb.json [07:11:25] T374845: decommission db2121.codfw.wmnet - https://phabricator.wikimedia.org/T374845 [07:12:29] (03CR) 10TrainBranchBot: "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:19:27] (03Merged) 10jenkins-bot: logging: Default to log any error (on group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:20:08] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]] [07:20:13] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [07:20:13] PROBLEM - eventlogging Varnishkafka log producer on cp3072 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [07:21:13] RECOVERY - eventlogging Varnishkafka log producer on cp3072 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [07:26:12] (03PS1) 10Arnaudb: mariadb: remove db2121 [puppet] - 10https://gerrit.wikimedia.org/r/1073324 (https://phabricator.wikimedia.org/T374845) [07:26:16] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2121 [puppet] - 10https://gerrit.wikimedia.org/r/1073324 (https://phabricator.wikimedia.org/T374845) (owner: 10Arnaudb) [07:26:31] !log testing purged 0.23 in cp4038 - T334078 [07:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:36] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [07:27:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [07:27:35] (03CR) 10Slyngshede: [C:03+2] Menu: Add menu entry for managers to view pending permission requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [07:28:10] 06SRE, 06Infrastructure-Foundations: Monitoring outgoing traffic for hosts with risky services - https://phabricator.wikimedia.org/T102104#10151656 (10MoritzMuehlenhoff) 05Open→03Declined [07:28:23] !log hashar@deploy1003 hashar: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:28:27] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [07:29:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2121.codfw.wmnet [07:29:09] !log hashar@deploy1003 hashar: Continuing with sync [07:31:39] (03Merged) 10jenkins-bot: Menu: Add menu entry for managers to view pending permission requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [07:33:36] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [07:35:44] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]] (duration: 15m 36s) [07:35:52] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [07:37:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:38:17] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2121.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [07:41:09] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:41:52] (03CR) 10Muehlenhoff: [C:03+1] "Nice test coverage! Couple of nits inline, but LGTM otherwise." [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [07:42:38] (03PS1) 10Brouberol: cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) [07:42:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2121.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [07:42:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:42:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2121.codfw.wmnet [07:43:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10151687 (10MoritzMuehlenhoff) [07:48:29] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2121.codfw.wmnet - https://phabricator.wikimedia.org/T374845#10151688 (10ABran-WMF) a:05ABran-WMF→03None this host is ready to be handled [07:48:38] (03PS2) 10Slyngshede: Notify managers via email [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 [07:49:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 T374846', diff saved to https://phabricator.wikimedia.org/P69199 and previous config saved to /var/cache/conftool/dbconfig/20240917-074918-arnaudb.json [07:49:23] T374846: decommission db2122.codfw.wmnet - https://phabricator.wikimedia.org/T374846 [07:51:10] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:51:32] the dbctl config alert might flap a bit today [07:52:09] (03PS1) 10Arnaudb: mariadb: remove db2122 [puppet] - 10https://gerrit.wikimedia.org/r/1073394 (https://phabricator.wikimedia.org/T374846) [07:52:12] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2122 [puppet] - 10https://gerrit.wikimedia.org/r/1073394 (https://phabricator.wikimedia.org/T374846) (owner: 10Arnaudb) [07:52:34] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:54:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2122.codfw.wmnet [07:58:10] (03CR) 10Slyngshede: Permission validation: Handle validation for manager approvals better. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [07:58:13] (03CR) 10Ebrahim: "@jrobson@wikimedia.org Do you have any concern about this? I think this will good to have tomorrow. Thanks 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [07:58:41] (03PS10) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [07:58:50] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:00:05] jnuche and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0800). [08:01:22] morning, deploying the train in a few minutes [08:02:02] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2122.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:03:01] (03CR) 10Slyngshede: [C:03+2] Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [08:04:19] (03CR) 10Volans: [C:03+1] "I'd say to move this discussion to a task, IIRC we discussed this in the past when the yaml file was introduced and in some cases we wante" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [08:04:38] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073395 (https://phabricator.wikimedia.org/T373642) [08:04:39] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073395 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:04:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2122.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:04:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:04:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2122.codfw.wmnet [08:04:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 T374846', diff saved to https://phabricator.wikimedia.org/P69200 and previous config saved to /var/cache/conftool/dbconfig/20240917-080453-arnaudb.json [08:04:57] T374846: decommission db2122.codfw.wmnet - https://phabricator.wikimedia.org/T374846 [08:05:29] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073395 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:05:45] (03Merged) 10jenkins-bot: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [08:10:37] (03CR) 10Jcrespo: "Ok with the general idea, but please coordinate with cloud, which is why this parameter was introduced in the the first place." [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [08:10:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [08:11:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2122.codfw.wmnet - https://phabricator.wikimedia.org/T374846#10151740 (10ABran-WMF) this host is ready to be handled [08:12:10] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.23 refs T373642 [08:12:14] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [08:13:10] jnuche: good morning! [08:13:41] jnuche: I have pushed to group1 a config change to have errors logged for all channels [08:13:52] so far, mostly frmo the `http` channel [08:14:14] and that started around 7:35. I will triage them :) [08:14:26] (03PS1) 10Arnaudb: mariadb: remove db2124 [puppet] - 10https://gerrit.wikimedia.org/r/1073397 (https://phabricator.wikimedia.org/T374847) [08:14:28] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2124 [puppet] - 10https://gerrit.wikimedia.org/r/1073397 (https://phabricator.wikimedia.org/T374847) (owner: 10Arnaudb) [08:14:51] hashar: morning, is that grafana you're talking about? [08:15:38] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [08:16:18] (03CR) 10DCausse: "if willing to undeploy the service I think a safer approach is to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [08:16:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 T374847', diff saved to https://phabricator.wikimedia.org/P69201 and previous config saved to /var/cache/conftool/dbconfig/20240917-081642-arnaudb.json [08:16:47] T374847: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847 [08:17:24] (03PS34) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [08:18:00] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [08:18:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2124.codfw.wmnet [08:18:38] 06SRE, 06Infrastructure-Foundations, 07LDAP, 07Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779#10151797 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These days we have Bitu running on idm.wikimedi... [08:19:24] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: CLI tools for CAS administration - https://phabricator.wikimedia.org/T233940#10151801 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The original script landed in the logout cookbook. [08:20:22] I see a spike in errors related to search coming from commons, but that started already last night [08:20:41] (03CR) 10David Caro: [C:03+2] typos: add colud to the list [puppet] - 10https://gerrit.wikimedia.org/r/1072188 (owner: 10David Caro) [08:20:58] hashar: don't see anything else right now, if there's something I should be monitoring during this train, please let me know [08:21:05] ^^ dcausse : could you have a look at error spike in search? [08:21:49] gehel: looking [08:21:55] thanks! [08:23:18] jnuche: is it "Search is currently too busy. Please try again later" that you're seeing? [08:23:31] jnuche: not much. There is some more mediawiki log messages with the `error` level but they don't show on the "mediawiki-NEW-errors" nor do they alert. Actual errors are in the 'error' channel (and the exception channel) [08:23:38] dcausse: yep [08:24:11] hashar: ack, thx [08:24:59] jnuche: ok, sadly these are "expected" so I think it's safe for you to ignore them [08:25:08] (03CR) 10David Caro: [C:03+1] "I think we can already manage them with tofu no?" [puppet] - 10https://gerrit.wikimedia.org/r/1034052 (owner: 10FNegri) [08:26:01] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:26:13] dcausse: alright, do you have an idea of how long will we see them? [08:27:25] jnuche: they generally don't last for long, it's traffic dependent, looking closer [08:27:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [08:27:51] 👍 [08:27:59] (03CR) 10Slyngshede: [C:03+2] Allow users to see log entires made by managers. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [08:30:54] (03Merged) 10jenkins-bot: Allow users to see log entires made by managers. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [08:31:18] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2124.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:31:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2124.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:31:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2124.codfw.wmnet [08:33:01] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847#10151843 (10ABran-WMF) this host is ready to be handled [08:36:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 T374848', diff saved to https://phabricator.wikimedia.org/P69202 and previous config saved to /var/cache/conftool/dbconfig/20240917-083652-arnaudb.json [08:36:57] T374848: decommission db2125.codfw.wmnet - https://phabricator.wikimedia.org/T374848 [08:37:32] (03PS2) 10JMeybohm: Don't restart(stop,start) ferm on puppet notify, use reload instead [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) [08:38:26] jnuche: I'll file a task to investigate more, the fact these are only commons is surprising, these errors should be totally unrelated with what you're deploying so it's safe to ignore them (sorry for the noise...) [08:38:33] (03PS1) 10Arnaudb: mariadb: remove db2125 [puppet] - 10https://gerrit.wikimedia.org/r/1073399 (https://phabricator.wikimedia.org/T374848) [08:38:35] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2125 [puppet] - 10https://gerrit.wikimedia.org/r/1073399 (https://phabricator.wikimedia.org/T374848) (owner: 10Arnaudb) [08:39:45] dcausse: sounds good, thx for looking into it [08:40:00] 10SRE-swift-storage, 06Data-Persistence, 07Wikimedia-production-error: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911 (10hashar) 03NEW [08:40:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2125.codfw.wmnet [08:40:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374848', diff saved to https://phabricator.wikimedia.org/P69203 and previous config saved to /var/cache/conftool/dbconfig/20240917-084036-arnaudb.json [08:45:02] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:46:40] (03CR) 10Elukey: [C:03+1] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1073321 (https://phabricator.wikimedia.org/T374901) (owner: 10Muehlenhoff) [08:48:11] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2125.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:49:38] (03PS1) 10AikoChou: ml-services: increase cpu and memory for ref-quality isvc in exp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 [08:49:57] jouncebot: next [08:49:57] In 1 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1000) [08:50:15] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [08:52:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2125.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:52:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2125.codfw.wmnet [08:52:42] (03PS35) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [08:53:06] (03PS1) 10Elukey: Set puppet7 for chartmuseum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1073401 (https://phabricator.wikimedia.org/T331969) [08:54:13] (03CR) 10Elukey: "Going to reimage the node :)" [puppet] - 10https://gerrit.wikimedia.org/r/1073401 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [08:54:16] jouncebot: nowandnext [08:54:16] For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T0800) [08:54:16] In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1000) [08:54:23] ok :) [08:54:43] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2125.codfw.wmnet - https://phabricator.wikimedia.org/T374848#10151920 (10ABran-WMF) this host is ready to be handled [08:54:45] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2125.codfw.wmnet - https://phabricator.wikimedia.org/T374848#10151925 (10ABran-WMF) a:05ABran-WMF→03None [08:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:54:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073401 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [08:55:06] (03CR) 10Elukey: [C:03+2] Set puppet7 for chartmuseum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1073401 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [08:55:40] !log elukey@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [08:55:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [08:55:51] !log elukey@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [08:55:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [08:56:21] 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, 07Wikimedia-production-error: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911#10151933 (10MatthewVernon) I'm a bit confused by this report - your text talks about POST, but the logs relate to HEAD? Anyhow, this is mo... [08:56:52] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum1001.eqiad.wmnet with OS bookworm [08:56:58] 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, 07Wikimedia-production-error: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911#10151940 (10MatthewVernon) Looking at the thumbor graphs it does look like //something// changed around 08:00 on 16 September. [08:57:02] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10151941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm [08:58:25] (03PS1) 10JMeybohm: kafka::broker: Populate cert SAN with hostname and IPs [puppet] - 10https://gerrit.wikimedia.org/r/1073402 (https://phabricator.wikimedia.org/T374729) [09:00:11] (03PS2) 10JMeybohm: kafka::broker: Populate cert SAN with hostname and IPs [puppet] - 10https://gerrit.wikimedia.org/r/1073402 (https://phabricator.wikimedia.org/T374729) [09:00:23] (03PS16) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [09:00:54] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073402 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [09:01:03] (03CR) 10CI reject: [V:04-1] logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [09:01:48] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations, 07IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139#10151973 (10Volans) This is the update list as of today: `clouddb2002-dev,cloudlb2004-dev,clouddb[1013-1020]`. I guess that the clouddb are... [09:02:37] (03PS2) 10Elukey: Swap poolcounter2003 with poolcounter2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) [09:04:13] FIRING: JobUnavailable: Reduced availability for job chartmuseum in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:06:33] (03PS1) 10AikoChou: ml-services: deploy ref-quality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073404 (https://phabricator.wikimedia.org/T371902) [09:06:56] (03PS3) 10Slyngshede: Notify managers via email [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 [09:07:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:07:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 T374849', diff saved to https://phabricator.wikimedia.org/P69204 and previous config saved to /var/cache/conftool/dbconfig/20240917-090733-arnaudb.json [09:07:39] T374849: decommission db2127.codfw.wmnet - https://phabricator.wikimedia.org/T374849 [09:07:40] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on chartmuseum1001.eqiad.wmnet with reason: host reimage [09:08:16] (03CR) 10JMeybohm: [C:03+2] Don't restart(stop,start) ferm on puppet notify, use reload instead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [09:09:08] (03CR) 10Elukey: [C:03+1] "LGTM, but please puppet-disable profile::kafka and roll it out carefully :)" [puppet] - 10https://gerrit.wikimedia.org/r/1073402 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [09:10:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on chartmuseum1001.eqiad.wmnet with reason: host reimage [09:10:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:11:43] (03CR) 10Elukey: "recheck" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff) [09:13:02] (03CR) 10AikoChou: "load testing shows little difference between 2Gi and 4Gi memory config, so I'll use 2Gi in prod deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 (owner: 10AikoChou) [09:13:47] (03CR) 10Muehlenhoff: "The CI check will fail anyway, given that it's configured to use buster and buster-backports is retired (so the Golang needed to build Cha" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff) [09:14:46] (03PS17) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [09:15:16] (03PS1) 10Arnaudb: mariadb: remove db2127 [puppet] - 10https://gerrit.wikimedia.org/r/1073406 (https://phabricator.wikimedia.org/T374849) [09:15:17] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2127 [puppet] - 10https://gerrit.wikimedia.org/r/1073406 (https://phabricator.wikimedia.org/T374849) (owner: 10Arnaudb) [09:17:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 T374849', diff saved to https://phabricator.wikimedia.org/P69205 and previous config saved to /var/cache/conftool/dbconfig/20240917-091706-arnaudb.json [09:17:11] T374849: decommission db2127.codfw.wmnet - https://phabricator.wikimedia.org/T374849 [09:17:13] (03CR) 10Jcrespo: "I think 1 second would be enough. It is usually milliseconds, and while there is no upper limit to that (there can be a performance proble" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 (owner: 10Volans) [09:17:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2127.codfw.wmnet [09:19:02] (03PS2) 10Volans: mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 [09:19:14] (03CR) 10Volans: "Ack, changed to 1s." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 (owner: 10Volans) [09:19:22] (03PS1) 10Hashar: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) [09:21:44] (03CR) 10Elukey: "Ah ok because I noticed the following and I was wondering if it was temporary:" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff) [09:22:20] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [09:22:29] hashar: o/ (if you have time) - I'd need to deploy a mw-config change during the next MW Infra deploy window, what is it the best scap command nowadays? [09:22:35] change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072206 [09:24:13] RESOLVED: JobUnavailable: Reduced availability for job chartmuseum in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:26] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2127.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:26:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host chartmuseum1001.eqiad.wmnet with OS bookworm [09:26:42] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm completed: - chartmuseum1001 (**PASS**)... [09:27:58] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1073321 (https://phabricator.wikimedia.org/T374901) (owner: 10Muehlenhoff) [09:28:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2127.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:28:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2127.codfw.wmnet [09:30:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: decommission db2127.codfw.wmnet - https://phabricator.wikimedia.org/T374849#10152069 (10ABran-WMF) a:05ABran-WMF→03None this host is ready to be handled [09:31:43] !log installing python-jwcrypto security updates [09:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:43] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [09:33:51] (03PS1) 10Elukey: sre.hosts.reimage: force TFTP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 [09:38:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 T374851', diff saved to https://phabricator.wikimedia.org/P69206 and previous config saved to /var/cache/conftool/dbconfig/20240917-093850-arnaudb.json [09:38:56] T374851: decommission db2137.codfw.wmnet - https://phabricator.wikimedia.org/T374851 [09:38:59] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway) [09:41:25] (03PS1) 10Arnaudb: mariadb: remove db2137 [puppet] - 10https://gerrit.wikimedia.org/r/1073410 (https://phabricator.wikimedia.org/T374851) [09:41:26] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2137 [puppet] - 10https://gerrit.wikimedia.org/r/1073410 (https://phabricator.wikimedia.org/T374851) (owner: 10Arnaudb) [09:42:09] (03CR) 10Volans: [C:04-1] "I think there is a typo, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 (owner: 10Elukey) [09:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 52.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:42:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 T374851', diff saved to https://phabricator.wikimedia.org/P69207 and previous config saved to /var/cache/conftool/dbconfig/20240917-094241-arnaudb.json [09:42:54] (03PS1) 10Filippo Giunchedi: corto: force directory removal [puppet] - 10https://gerrit.wikimedia.org/r/1073412 [09:43:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2137.codfw.wmnet [09:43:15] (03CR) 10CI reject: [V:04-1] corto: force directory removal [puppet] - 10https://gerrit.wikimedia.org/r/1073412 (owner: 10Filippo Giunchedi) [09:43:49] (03CR) 10Jcrespo: [C:03+1] mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 (owner: 10Volans) [09:44:05] (03CR) 10Volans: [C:03+2] mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 (owner: 10Volans) [09:44:14] (03PS2) 10Filippo Giunchedi: corto: force directory removal [puppet] - 10https://gerrit.wikimedia.org/r/1073412 [09:45:17] !log disabling puppet on all kafka brokers for rollout of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073402 [09:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:28] (03CR) 10JMeybohm: [C:03+2] kafka::broker: Populate cert SAN with hostname and IPs [puppet] - 10https://gerrit.wikimedia.org/r/1073402 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [09:46:38] (03PS2) 10Elukey: sre.hosts.reimage: force TFTP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 [09:46:43] (03CR) 10Elukey: sre.hosts.reimage: force TFTP for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 (owner: 10Elukey) [09:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 52.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:47:48] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [09:48:47] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 (owner: 10Elukey) [09:48:50] (03PS1) 10Muehlenhoff: Switch chartmuseum to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1073414 (https://phabricator.wikimedia.org/T349619) [09:49:43] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073404 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [09:51:11] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2137.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:51:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2137.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [09:51:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:51:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2137.codfw.wmnet [09:51:37] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2137.codfw.wmnet - https://phabricator.wikimedia.org/T374851#10152217 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by arnaudb@cumin1002 for hosts: `db2137.codfw.wmnet` - db2137.codfw.wmnet (**PAS... [09:51:40] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on kafka-main2006.codfw.wmnet with reason: Rollout of 1073402 [09:51:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kafka-main2006.codfw.wmnet with reason: Rollout of 1073402 [09:52:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 T374852', diff saved to https://phabricator.wikimedia.org/P69208 and previous config saved to /var/cache/conftool/dbconfig/20240917-095230-arnaudb.json [09:52:32] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2137.codfw.wmnet - https://phabricator.wikimedia.org/T374851#10152218 (10ABran-WMF) a:05ABran-WMF→03None this host is ready to be handled [09:52:35] T374852: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852 [09:54:40] (03Merged) 10jenkins-bot: mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 (owner: 10Volans) [09:54:56] (03PS1) 10Arnaudb: mariadb: remove db2138 [puppet] - 10https://gerrit.wikimedia.org/r/1073416 (https://phabricator.wikimedia.org/T374852) [09:54:57] (03CR) 10Arnaudb: [C:03+2] mariadb: remove db2138 [puppet] - 10https://gerrit.wikimedia.org/r/1073416 (https://phabricator.wikimedia.org/T374852) (owner: 10Arnaudb) [09:56:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 T374852', diff saved to https://phabricator.wikimedia.org/P69209 and previous config saved to /var/cache/conftool/dbconfig/20240917-095625-arnaudb.json [09:57:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2138.codfw.wmnet [09:57:22] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2006.codfw.wmnet [09:57:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2006.codfw.wmnet [09:58:39] !log re-enable puppet on all kafka brokers [09:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:33] (03Abandoned) 10Muehlenhoff: Install a NOTICE file [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1000) [10:00:04] elukey: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:31] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.13.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073420 [10:01:47] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [10:05:51] (03CR) 10Elukey: [C:03+2] Swap poolcounter2003 with poolcounter2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:05:54] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2138.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [10:06:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2138.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [10:06:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2138.codfw.wmnet [10:07:48] !log elukey@deploy1003 Started scap sync-world: Swap poolcounter2003 with poolcounter2005 [10:09:43] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852#10152304 (10ABran-WMF) a:05ABran-WMF→03None this host is ready to be handled [10:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 22.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:11:32] (03CR) 10Klausman: [C:03+1] slo_template: update the SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1073257 (owner: 10Ilias Sarantopoulos) [10:13:20] (03CR) 10Klausman: [C:03+2] slo_template: update the SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1073257 (owner: 10Ilias Sarantopoulos) [10:13:22] (03CR) 10Klausman: [V:03+2 C:03+2] slo_template: update the SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1073257 (owner: 10Ilias Sarantopoulos) [10:13:37] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.13.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073420 (owner: 10Volans) [10:13:58] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for prometheus::pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1072733 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:16:19] (03PS2) 10Muehlenhoff: Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) [10:17:20] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase cpu and memory for ref-quality isvc in exp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 (owner: 10AikoChou) [10:18:09] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy ref-quality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073404 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [10:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 20% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:23:00] (03CR) 10Klausman: [C:03+1] sre.hosts.reimage: force TFTP for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 (owner: 10Elukey) [10:23:25] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: force TFTP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1073409 (owner: 10Elukey) [10:25:27] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.13.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073420 (owner: 10Volans) [10:26:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 20% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:29:48] (03PS1) 10Volans: Upstream release v8.13.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1073424 [10:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 2.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:33:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 13s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 22.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:37:11] (03PS3) 10Fabfur: hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) [10:37:14] !log elukey@deploy1003 Finished scap sync-world: Swap poolcounter2003 with poolcounter2005 (duration: 30m 00s) [10:38:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 13s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:38:27] !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]] [10:38:31] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [10:40:16] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:40:41] (03CR) 10Vgutierrez: hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:41:12] (03CR) 10Vgutierrez: hiera: enable haproxykafka on cp3066 for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:44:19] (03PS4) 10Fabfur: hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) [10:44:44] (03CR) 10Fabfur: hiera: enable haproxykafka on cp3066 for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:45:06] !log elukey@deploy1003 elukey: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:45:10] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [10:45:18] (03PS2) 10Effie Mouzeli: ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) [10:45:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host deploy1003.eqiad.wmnet [10:45:22] !log elukey@deploy1003 elukey: Continuing with sync [10:46:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 32.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:46:20] effie, Amir1: puppet is disabled on mwdebug1001 with your name on it, should it be enabled? [10:46:44] (03CR) 10Volans: [C:03+2] Upstream release v8.13.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1073424 (owner: 10Volans) [10:47:06] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:47:28] (03PS1) 10Muehlenhoff: Switch deploy1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073425 (https://phabricator.wikimedia.org/T349619) [10:47:49] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [10:48:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [10:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 22.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:48:21] (03PS2) 10Muehlenhoff: Switch deploy1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073425 (https://phabricator.wikimedia.org/T349619) [10:48:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:37] hnowlan: sigh, we forgot all about it after closing the relevant task [10:48:49] hnowlan: let me sort it, sorry [10:49:07] np, thanks [10:49:50] sorted [10:50:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:47] (03CR) 10AikoChou: [C:03+2] ml-services: deploy ref-quality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073404 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [10:51:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 53s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:51:39] (03Merged) 10jenkins-bot: ml-services: deploy ref-quality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073404 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [10:51:46] !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]] (duration: 13m 19s) [10:51:50] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [10:52:15] (03CR) 10Muehlenhoff: [C:03+2] Switch deploy1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073425 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:58:25] (03Merged) 10jenkins-bot: Upstream release v8.13.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1073424 (owner: 10Volans) [11:00:16] jouncebot: now [11:00:16] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [11:00:30] I’ll roll out https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1073247 if that’s okay :) [11:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host deploy1003.eqiad.wmnet [11:01:23] (03PS2) 10Lucas Werkmeister (WMDE): termbox: update to 2024-09-09-102106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073247 (https://phabricator.wikimedia.org/T373088) [11:01:29] (03PS2) 10Lucas Werkmeister (WMDE): termbox: remove redundant values-test.yaml lines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073248 [11:01:54] (03PS1) 10Elukey: Swap poolcounter{2004,1004,1005} with the newer Bookworm-based hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) [11:04:46] moritzm: just to confirm, is it okay to deploy from deploy1003? [11:04:55] or are you still working on puppet things? [11:05:15] !log uploaded spicerack_8.13.1 to apt.wikimedia.org bullseye-wikimedia [11:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:46] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152511 (10elukey) Last step https://gerrit.wikimedia.org/r/c/integration/config/+/1073426 [11:07:38] !log installed spicerack_8.13.1 on the cumin hosts [11:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:48] (03PS1) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 [11:08:11] (03PS2) 10Seanleong-wmde: Revert^2 "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 [11:11:03] alright, I’ll go ahead with my deployment, shout if I should stop [11:11:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] termbox: update to 2024-09-09-102106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073247 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [11:12:07] (03Merged) 10jenkins-bot: termbox: update to 2024-09-09-102106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073247 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [11:12:53] hmph, there’s a diff in deployment-charts on deploy1003 [11:13:24] Lucas_WMDE: sorry, missed your ping. please go ahead, I'm done [11:13:31] alright, thanks :) [11:13:36] and congrats on the puppet 7 migration ^^ [11:13:56] there's still a few more roles to come :-) [11:14:14] hehe [11:14:22] (03PS1) 10Elukey: Set puppet config for registry2005 [puppet] - 10https://gerrit.wikimedia.org/r/1073429 (https://phabricator.wikimedia.org/T374928) [11:14:25] hnowlan: do you know about the uncommited change in helmfile.d/services/shellbox-video/values.yaml by any chance? [11:14:36] (it doesn’t look like it affects me directly so I’ll just continue anyway) [11:14:53] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10152556 (10MoritzMuehlenhoff) [11:15:46] (03PS2) 10Elukey: Set puppet config for registry2005 [puppet] - 10https://gerrit.wikimedia.org/r/1073429 (https://phabricator.wikimedia.org/T374928) [11:15:56] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/termbox: apply [11:16:17] hm, slightly bigger diff than expected [11:16:38] I guess the puppetca.crt.pem changed due to the Puppet 7 migration moritzm? okay to deploy? [11:16:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073429 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [11:18:22] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:19:15] (03PS1) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) [11:19:38] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:19:42] Lucas_WMDE: I'm surprised that this shows up in the diff? [11:19:50] it should be entirely local to the deployment server itself [11:20:01] should I put it in a paste or something? [11:20:14] it’s under “termbox, config-staging, ConfigMap (v1) has changed:” [11:20:28] (whereas the actual diff I want to deploy is apparently a separate thing, “termbox, termbox-staging, Deployment (apps) has changed:”) [11:21:34] good question, I'm not sure why it actually appears there and what it's being used within the deployment [11:21:57] can you copy it to a paste, so that the serviceops folks can have a look, please? [11:22:04] sure [11:23:27] well, I made it a private paste at https://phabricator.wikimedia.org/P69210 out of caution, but I’m not sure who I should subscribe to it for access [11:25:22] we can make it public, there should be nothing sensitive in it [11:25:42] Lucas_WMDE: yep, apologies. Cleaned up [11:25:57] hnowlan: np, thanks! [11:26:18] hm, not sure how to make the paste public [11:26:20] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [11:26:24] probably easiest to just make a new one ^^ [11:26:43] hnowlan: any idea why the puppet ca cert gets shown in the diff for ^ [11:27:03] public paste here: https://phabricator.wikimedia.org/P69211 [11:27:06] is it actually used for anything? most probably not given deploy used to be on Puppet 5 until earlier [11:29:26] hm, actually, puppetca.crt.pem rings a bell… we had some problem with that in T368523 I think (already resolved in the meantime) [11:29:27] T368523: Migrate wikibase-termbox to node20 - https://phabricator.wikimedia.org/T368523 [11:29:31] maybe that was the reason ^^ [11:30:08] (might not be worth investigating now though) [11:30:38] it's almost certainly used for ca validation but I'm not sure why it would have changed [11:30:43] when was termbox last deployed? [11:30:55] 18 july [11:30:57] https://sal.toolforge.org/production?p=0&q=termbox&d= [11:32:02] jouncebot: nowandnext [11:32:02] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [11:32:02] In 0 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1200) [11:32:16] Dreamy_Jazz: I’m in the middle of a helmfile deployment fwiw [11:32:38] Sure. Wanted to get a link to the calendar, not planning on deploying. But thanks for letting me know. [11:32:44] ah, okay :) [11:32:44] puppet_ca_crt comes from $facts['puppet_config']['localcacert'] [11:33:29] so that updating on the deploy host has triggered this I guess? [11:33:42] I see a similar diff for wikifeeds which uses the same mechanism [11:34:37] should I just roll it out and see if it still works? [11:34:46] this is just the staging service anyway, it’s okay if it breaks for a bit [11:35:56] yeah give it a go [11:36:11] this file would have changed with puppetserver rollout I guess? [11:36:25] ok, going [11:36:32] either way everything should be using the wmf-certificates package these days [11:36:38] yes, it's now a cert issued by the Puppet 7 CA [11:36:40] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/termbox: apply [11:36:45] * Lucas_WMDE tests [11:37:39] (03PS1) 10Btullis: Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) [11:37:40] seems to work fine, let’s do eqiad+codfw then [11:37:46] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [11:38:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4002/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [11:38:34] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [11:38:43] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/termbox: apply [11:39:26] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/termbox: apply [11:42:37] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "I’ll deploy this and check that `helmfile apply` shows no diff; if it does, I’ll revert." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073248 (owner: 10Lucas Werkmeister (WMDE)) [11:43:52] (03Merged) 10jenkins-bot: termbox: remove redundant values-test.yaml lines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073248 (owner: 10Lucas Werkmeister (WMDE)) [11:44:09] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/termbox: apply [11:44:11] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/termbox: apply [11:44:16] no diff, yay [11:44:20] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [11:44:22] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [11:44:25] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/termbox: apply [11:44:29] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/termbox: apply [11:44:32] 2× ditto \o/ [11:44:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "There was indeed no diff on `staging`, `eqiad` or `codfw`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073248 (owner: 10Lucas Werkmeister (WMDE)) [11:45:21] * Lucas_WMDE done deploying [11:48:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:50:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:20] (03CR) 10Effie Mouzeli: [C:03+2] app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500 (owner: 10Effie Mouzeli) [11:54:26] (03Merged) 10jenkins-bot: app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500 (owner: 10Effie Mouzeli) [11:54:33] (03Merged) 10jenkins-bot: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [11:58:05] (03PS3) 10Effie Mouzeli: ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1200) [12:02:22] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:09:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:10:49] (03CR) 10Muehlenhoff: [C:03+2] Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [12:11:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:15:20] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on cr1-eqiad with reason: enable ixp port cr1-eqiad [12:15:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cr1-eqiad with reason: enable ixp port cr1-eqiad [12:15:58] !log disable Equinix IXP peering on cr1-eqiad in advance of port move [12:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:06] (03PS1) 10Effie Mouzeli: ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885) [12:22:59] (03PS1) 10AikoChou: ml-services: update resources config for ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073444 [12:23:03] (03PS1) 10Brouberol: airflow: ensure each airflow release store logs to a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) [12:23:05] (03PS1) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) [12:23:23] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10152815 (10MoritzMuehlenhoff) Other patches merged for this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072171 https://gerrit.wikimedia.org... [12:23:33] !log reconfigure cr1-eqiad xe-3/0/6 into LAG grou ae6 for Equinix IXP peering T370696 [12:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:29] (03PS2) 10Brouberol: airflow: ensure each airflow release store logs to a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) [12:28:30] (03PS2) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) [12:29:33] (03PS3) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) [12:29:36] 06SRE, 06Infrastructure-Foundations: bmc-config (and thus ipmi_lan fact) returns 0.0.0.0 under certain conditions - https://phabricator.wikimedia.org/T321314#10152856 (10joanna_borun) If this is still still an issue please reopen task. [12:29:38] 06SRE, 06Infrastructure-Foundations: bmc-config (and thus ipmi_lan fact) returns 0.0.0.0 under certain conditions - https://phabricator.wikimedia.org/T321314#10152857 (10joanna_borun) 05Open→03Invalid [12:30:37] (03PS4) 10Slyngshede: Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 [12:30:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on cr1-eqiad with reason: enable ixp port cr1-eqiad [12:30:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cr1-eqiad with reason: enable ixp port cr1-eqiad [12:34:56] (03PS1) 10Brouberol: airflow: allow the webserver and scheduler to be deployed or not [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [12:35:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1073448 (https://phabricator.wikimedia.org/T374946) [12:35:50] (03CR) 10CI reject: [V:04-1] airflow: allow the webserver and scheduler to be deployed or not [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [12:38:02] (03PS2) 10Brouberol: airflow: allow the webserver and scheduler to be deployed or not [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [12:38:30] !log disable Equinix IXP BGP peers on cr2-eqiad before reconfiguring port as LAG T370696 [12:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T374946 [12:39:24] T374946: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T374946 [12:39:38] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 478, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:40:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T374946 [12:40:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2165 with weight 0 T374946', diff saved to https://phabricator.wikimedia.org/P69212 and previous config saved to /var/cache/conftool/dbconfig/20240917-124022-arnaudb.json [12:44:53] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1073448 (https://phabricator.wikimedia.org/T374946) (owner: 10Gerrit maintenance bot) [12:46:06] !log Starting s8 codfw failover from db2161 to db2165 - T374946 [12:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:11] T374946: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T374946 [12:46:30] (03CR) 10Elukey: [C:03+2] Set puppet config for registry2005 [puppet] - 10https://gerrit.wikimedia.org/r/1073429 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [12:46:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T374946', diff saved to https://phabricator.wikimedia.org/P69213 and previous config saved to /var/cache/conftool/dbconfig/20240917-124638-arnaudb.json [12:47:26] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host registry2005.codfw.wmnet [12:47:28] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:49:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'codfw/db2161 weight', diff saved to https://phabricator.wikimedia.org/P69214 and previous config saved to /var/cache/conftool/dbconfig/20240917-124927-arnaudb.json [12:49:57] (03PS1) 10Elukey: README: add dots to trigger a change [labs/private] - 10https://gerrit.wikimedia.org/r/1073449 (https://phabricator.wikimedia.org/T374443) [12:50:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [12:50:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [12:50:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T367781)', diff saved to https://phabricator.wikimedia.org/P69215 and previous config saved to /var/cache/conftool/dbconfig/20240917-125032-arnaudb.json [12:50:40] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:50:42] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry2005.codfw.wmnet - elukey@cumin1002" [12:50:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry2005.codfw.wmnet - elukey@cumin1002" [12:50:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:50:47] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache registry2005.codfw.wmnet on all recursors [12:50:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) registry2005.codfw.wmnet on all recursors [12:51:11] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10152985 (10MoritzMuehlenhoff) > A ton of puppet code seems to assume the presence of puppet_ca_server, that currently points to... [12:51:17] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry2005.codfw.wmnet - elukey@cumin1002" [12:51:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry2005.codfw.wmnet - elukey@cumin1002" [12:51:56] (03CR) 10Elukey: [V:03+2 C:03+2] README: add dots to trigger a change [labs/private] - 10https://gerrit.wikimedia.org/r/1073449 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [12:52:35] (03CR) 10Klausman: [C:03+1] ml-services: increase cpu and memory for ref-quality isvc in exp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 (owner: 10AikoChou) [12:52:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T367781)', diff saved to https://phabricator.wikimedia.org/P69216 and previous config saved to /var/cache/conftool/dbconfig/20240917-125242-arnaudb.json [12:54:26] 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, 06Traffic: improve GeoDNS-to-edge mapping - https://phabricator.wikimedia.org/T316160#10152990 (10CDanis) We did a somewhat experimental version of this work as @JameelKaisar's intern project in {T332024} and friends. The infrastructure pieces h... [12:54:44] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [12:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:55:10] (03CR) 10Xcollazo: [C:03+1] Move the misc_crons dumper role from snapshot1017 to snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis) [12:56:08] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [12:56:32] 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, 06Traffic: improve GeoDNS-to-edge mapping - https://phabricator.wikimedia.org/T316160#10153011 (10CDanis) Oh, and one related thing, we should fix T347114 -- I think that's just a VCL change. [12:57:07] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:57:28] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry2005.codfw.wmnet with OS bookworm [12:57:56] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [12:58:58] (03Abandoned) 10Effie Mouzeli: ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1300). [13:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] (03CR) 10Hashar: logging: Default to log any error (all wikis) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:00:15] o/ [13:00:48] (03CR) 10Hashar: [C:03+2] logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:00:55] grr [13:00:57] wrong click [13:01:03] I wanted to schedule it for this window [13:01:07] and clicked +2 instead :/ [13:01:34] (03Merged) 10jenkins-bot: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:01:39] Lucas_WMDE: I'll self deploy and do Daimona deployment-prep patch after [13:02:02] ok! [13:02:16] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]] [13:02:20] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [13:02:24] hashar: last I checked the tool doesn’t let you schedule changes for the running window anyway :/ [13:02:30] so you’d have to do it manually [13:02:30] :D [13:03:27] Hi there! [13:04:24] !log hashar@deploy1003 hashar: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:04:49] !log hashar@deploy1003 hashar: Continuing with sync [13:05:39] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10153037 (10elukey) On the infrastructure side we now have: * envoy+httpd on every `puppetserverXXXX` host, that are able to hos... [13:07:05] Daimona: for the patches solely affecting beta, I am quite happy to +2 them at any time :) [13:07:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P69217 and previous config saved to /var/cache/conftool/dbconfig/20240917-130750-arnaudb.json [13:08:03] I asked yesterday but saw some chaos, so I thought I'd schedule them like normal patches anyway ;) [13:08:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3m 20s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:09:23] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]] (duration: 07m 06s) [13:09:26] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [13:09:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073428 (owner: 10Seanleong-wmde) [13:11:32] (03CR) 10Hashar: [C:03+2] beta: Enable CampaignEvents Community List [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) (owner: 10Daimona Eaytoy) [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) (owner: 10Daimona Eaytoy) [13:12:13] (03Merged) 10jenkins-bot: beta: Enable CampaignEvents Community List [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) (owner: 10Daimona Eaytoy) [13:12:22] (03PS1) 10Muehlenhoff: Point puppet_merge_server to puppetserver1001 [puppet] - 10https://gerrit.wikimedia.org/r/1073451 (https://phabricator.wikimedia.org/T374443) [13:13:07] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry2005.codfw.wmnet with reason: host reimage [13:13:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3m 20s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:13:27] ah yes wikifunctions [13:14:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [13:14:10] Daimona: your config change is being pulled on beta by Jenkins: https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/513562/console [13:14:10] (03PS3) 10FNegri: wikireplica_dns: remove toolsdb and redis records [puppet] - 10https://gerrit.wikimedia.org/r/1034052 (https://phabricator.wikimedia.org/T374953) [13:14:15] which will then run scap to deploy it [13:14:34] (03CR) 10FNegri: "Yes, I will do that as part of T374953" [puppet] - 10https://gerrit.wikimedia.org/r/1034052 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri) [13:15:10] Looking [13:15:11] I am waiting for the spike of `objectcache` and `memcached` errors to complete [13:15:38] then will check whether enabling error logging on all wikis causes any issue [13:15:45] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10153082 (10MoritzMuehlenhoff) This looks good to me. The specific migration would look like: - Merge https://gerrit.wikimedia.or... [13:16:10] (03PS2) 10JHathaway: mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 [13:16:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry2005.codfw.wmnet with reason: host reimage [13:17:05] (03CR) 10JHathaway: "Sounds good, @dcaro@wikimedia.org does this look reasonable?" [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [13:17:16] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [13:18:25] (03CR) 10David Caro: "Sounds good to me :), do you have a passing pcc?" [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [13:18:34] logs look good [13:18:34] !log UTC afternoon backport window completed [13:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:26] beta LGTM (@HouseOfM) [13:19:50] (03PS1) 10BBlack: NetworkProbeLimit Cookie: avoid nop re-set-cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073453 (https://phabricator.wikimedia.org/T347114) [13:20:06] Merci hashar! [13:20:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [13:20:29] Daimona: de rien! [13:20:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [13:21:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:36] hi hashar, u do deploy for this window again? [13:21:46] Hamishcz: yeah i can [13:21:55] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [13:22:21] !log stevemunene@cumin1002 START - Cookbook sre.hosts.rename from kafka-stretch1001 to an-worker1176 [13:22:23] actually I didnt add my patch on wikitech in advance, would u mind if I add one more patch for u? [13:22:38] just come back from my real life stuff [13:22:43] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [13:22:47] Hamishcz: that is for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072876 isn't it? [13:22:54] adding the ContactPage for zhwiki? [13:22:55] exactly [13:22:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P69218 and previous config saved to /var/cache/conftool/dbconfig/20240917-132257-arnaudb.json [13:23:36] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073453 (https://phabricator.wikimedia.org/T347114) (owner: 10BBlack) [13:24:31] Hamishcz: please add it to the Deployments page and I am deploying it now [13:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [13:24:59] great thx [13:25:40] (03Merged) 10jenkins-bot: Configure ContactPage and IPBE contact form on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [13:25:59] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072876|Configure ContactPage and IPBE contact form on zhwiki (T359998)]] [13:26:04] T359998: Configure ContactPage and IPBE contact form on zhwiki - https://phabricator.wikimedia.org/T359998 [13:26:28] it is interesting to see the zh.wikipedia.org banner points to https://www.facebook.com/zhwikipedia and has links to Discord, IRC, LINE, QQ, Telegram [13:26:29] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10153107 (10phaultfinder) [13:26:38] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch1001 to an-worker1176 - stevemunene@cumin1002" [13:27:57] lol we have multiple virtual communities [13:28:10] !log hashar@deploy1003 hashar, hamishz: Backport for [[gerrit:1072876|Configure ContactPage and IPBE contact form on zhwiki (T359998)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:29:05] ty [13:29:23] Hamishcz: the patch is on the debug servers [13:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374642#10153135 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Closing this ticket reopening original looks like possibly drive rebuild was not finished by service owner [13:29:45] ack :) [13:29:51] doing test [13:30:03] I don't see it on https://zh.wikipedia.org/wiki/Special:Contact/ipbe :D [13:30:22] same to me [13:30:23] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch1001 to an-worker1176 - stevemunene@cumin1002" [13:30:23] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:24] !log stevemunene@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1176 [13:30:32] You have requested an invalid special page. [13:30:33] huhu [13:30:42] cause I guess the extension needs to be enabled on that wiki? [13:30:49] trying to find whats going on [13:30:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10153139 (10Jclark-ctr) 05Resolved→03Open @andrea.denisse Where you able to resync drive using mdadm? dcops is unable to run mdadm commands [13:30:59] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1176 [13:31:14] hashar: is it okay if I run a purgeList maintenance script on mwmaint1002? or should I wait until you’re done? [13:31:18] !log hashar@deploy1003 hashar, hamishz: Continuing with sync [13:31:25] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10153144 (10Jclark-ctr) Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] md0 : active raid10 sdb2[0] sde2[4] sdf2[7] sdd2[2] sda2[5] sdh2[1] sdg2[6] 7499796480... [13:31:26] Lucas_WMDE: please proceed! [13:31:33] thank you for the notice [13:31:36] hashar: That would be my guess, if it's not already [13:31:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kafka-stretch1001 to an-worker1176 [13:31:38] ah yes I forgot that [13:31:40] Hamishcz: I will send another patch to enable the extension [13:31:45] ok thanks! [13:31:55] !log lucaswerkmeister-wmde@mwmaint1002:~$ for domain in query{,-{main,scholarly}}; do for path in / /index.html /i18n/en.json /{default,custom}-config.json; do printf 'https://%s.wikidata.org%s\n' "$domain" "$path"; done; done | mwscript purgeList enwiki # try to refresh WDQS GUI cache, don’t know if it’ll work [13:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] (03PS1) 10Hashar: Enable ContactPage extension on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073454 (https://phabricator.wikimedia.org/T359998) [13:32:35] yes.. the wmgUseContactPage [13:32:36] Hamishcz: Reedy ^ :) [13:32:49] quick response lol [13:33:16] I should have seen it when reviewing hehe [13:33:21] I thought I have added....maybe my brain was broken [13:33:32] anyway it's my bad first [13:33:34] no worries [13:33:58] (03PS1) 10DCausse: Revert^2 "cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073455 [13:34:14] (03PS2) 10DCausse: Revert^2 "cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073455 [13:34:42] !log stevemunene@cumin1002 START - Cookbook sre.hosts.rename from kafka-stretch1002 to an-worker1177 [13:35:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [13:35:04] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [13:35:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10153162 (10phaultfinder) [13:35:07] (03PS3) 10DCausse: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073455 [13:35:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153159 (10Jclark-ctr) ` Device: /dev/sda ID_SERIAL=SSDSC2KG240G7R_PHYM812600ZH240AGN ID_SERIAL_SHORT=PHYM812600ZH240AGN ID_PATH=pci-0000:00:11.5-ata-3 ID_PATH_TAG=pci-0000_00_11_5-ata-3 Device: /dev/... [13:35:19] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10153174 (10Dreamy_Jazz) [13:36:00] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072876|Configure ContactPage and IPBE contact form on zhwiki (T359998)]] (duration: 10m 00s) [13:36:03] T359998: Configure ContactPage and IPBE contact form on zhwiki - https://phabricator.wikimedia.org/T359998 [13:36:24] !log (for the record, refreshing the WDQS GUI cache five minutes ago seems to have worked well enough… that, or the cache just happened to expire around the same time ^^) [13:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:45] (figured it would be nice to have a follow-up for the “don’t know if it’ll work” in the SAL ^^) [13:36:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073454 (https://phabricator.wikimedia.org/T359998) (owner: 10Hashar) [13:37:29] (03Merged) 10jenkins-bot: Enable ContactPage extension on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073454 (https://phabricator.wikimedia.org/T359998) (owner: 10Hashar) [13:37:50] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073454|Enable ContactPage extension on zhwiki (T359998)]] [13:38:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T367781)', diff saved to https://phabricator.wikimedia.org/P69219 and previous config saved to /var/cache/conftool/dbconfig/20240917-133805-arnaudb.json [13:38:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:38:32] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch1002 to an-worker1177 - stevemunene@cumin1002" [13:38:59] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch1002 to an-worker1177 - stevemunene@cumin1002" [13:38:59] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:39:00] !log stevemunene@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1177 [13:39:47] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1177 [13:39:59] !log hashar@deploy1003 hashar: Backport for [[gerrit:1073454|Enable ContactPage extension on zhwiki (T359998)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:26] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kafka-stretch1002 to an-worker1177 [13:40:44] it's live now [13:40:48] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073455 (owner: 10DCausse) [13:40:49] Hamishcz: ahh [13:40:49] I've already seen it [13:40:55] there are some messages missing, but that is easy to solve [13:41:05] my guess is they are missing from the extension? [13:41:13] or can be added to the MediaWiki namespace [13:41:15] just MediaWiki:xxxxxd [13:41:19] great [13:41:22] !log hashar@deploy1003 hashar: Continuing with sync [13:41:23] (03CR) 10JHathaway: "now I do, \o/, https://puppet-compiler.wmflabs.org/output/1073292/1869/" [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [13:41:30] I'll tell the community to solve that [13:41:34] and it can be synced now [13:41:42] ok.. quick response again [13:41:43] yeah our users are our best asset :] [13:41:46] (03Merged) 10jenkins-bot: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073455 (owner: 10DCausse) [13:42:33] and if some extra fields need to be added, that should now be easy (just edit wmf-config/ZhWikiContactPages.php ) [13:42:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [13:42:59] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:43:01] yes I got that [13:43:08] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:43:15] thank you veeeeery much [13:43:28] Hamishcz: thanks to you too! [13:43:42] and sorry we could not deploy it yesterday, everything was way too slow :) [13:45:21] (03PS1) 10DCausse: Revert "cirrus-streaming-updater: test use_all_dns_ips" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073456 [13:45:38] (03CR) 10DCausse: [C:03+2] Revert "cirrus-streaming-updater: test use_all_dns_ips" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073456 (owner: 10DCausse) [13:46:00] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073454|Enable ContactPage extension on zhwiki (T359998)]] (duration: 08m 09s) [13:46:04] its okay, I left earlier than scheduled close time yesterday [13:46:04] T359998: Configure ContactPage and IPBE contact form on zhwiki - https://phabricator.wikimedia.org/T359998 [13:46:13] but it goes perfect noww xddd [13:46:38] (03Merged) 10jenkins-bot: Revert "cirrus-streaming-updater: test use_all_dns_ips" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073456 (owner: 10DCausse) [13:46:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [13:47:12] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:48:13] !log UTC afternoon backport window completed (again!) [13:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153201 (10Jclark-ctr) a:03Jclark-ctr checking hardware inventory confirmed failed sda is slot 0 @Muehlenhoff would you be good point of contact for rebuilding drive when drive is replaced? `... [13:51:27] !log copy jwt-authorizer from bullseye-wikimedia to bookworm-wikimedia [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [13:54:14] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [13:57:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [14:02:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153252 (10Jclark-ctr) @VRiley-WMF looks like you replaced drive T373888. with @MoritzMuehlenhoff so idrac hardware log is still listing old s/n ` 2024-09-05 10:48:59 USR0032 The sessi... [14:04:17] (03PS2) 10Hashar: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) [14:04:42] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add new an worker keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1072655 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [14:09:37] (03PS1) 10Slyngshede: P:idp: On test host behind the load balancer, avoid exposing port 8080. [puppet] - 10https://gerrit.wikimedia.org/r/1073460 [14:10:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153302 (10Jclark-ctr) i believe the UUID be different between drives i know partitions will sometimes be the same UUID I do not have access to blkid command was sgdisk -G completed on new drive?... [14:13:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4003/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [14:15:14] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4004/console" [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [14:21:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153417 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF [14:23:18] (03PS1) 10Brouberol: cloudnative-pg: grant the deploy user the ability to create manual backups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 [14:23:21] (03PS4) 10Abijeet Patro: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) [14:23:28] (03PS5) 10Abijeet Patro: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) [14:27:24] (03PS2) 10Hashar: Stop flagging a conflict for same file modifications [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 [14:29:31] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt with reason: Reboot cloudsw1-c8-eqiad and upgrade JunOS [14:29:38] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt with reason: Reboot cloudsw1-c8-eqiad and upgrade JunOS [14:30:01] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [14:36:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10153515 (10phaultfinder) [14:37:30] (03CR) 10JHathaway: [C:03+2] k8s::kubelet: fix deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway) [14:39:03] (03CR) 10Slyngshede: [V:03+1] "Open to better ways of doing this." [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [14:39:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:16] (03PS1) 10Elukey: docker_registy_ha: use python3-swiftclient on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1073465 (https://phabricator.wikimedia.org/T374928) [14:43:30] (03CR) 10CDanis: [C:03+1] docker_registy_ha: use python3-swiftclient on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1073465 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [14:43:41] (03CR) 10Elukey: [C:03+2] docker_registy_ha: use python3-swiftclient on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1073465 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [14:45:02] !log dancy@deploy1003 Started deploy [releng/phatality@84c7283]: (no justification provided) [14:45:20] !log dancy@deploy1003 Finished deploy [releng/phatality@84c7283]: (no justification provided) (duration: 00m 18s) [14:46:56] !log dancy@deploy1003 Started deploy [releng/phatality@84c7283]: T374880 [14:47:00] T374880: scap phatality deployment problem - https://phabricator.wikimedia.org/T374880 [14:47:05] !log dancy@deploy1003 Finished deploy [releng/phatality@84c7283]: T374880 (duration: 00m 09s) [14:47:25] (03CR) 10Muehlenhoff: "On poolcounter2003 there still 22 connections to the poolcounter service (ss | grep 7531), most them I could not map to a specific service" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [14:48:06] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-debug: add initial "next" release (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [14:49:41] rzl, herron or other SRE: Can you reenable puppet on logstash1023.eqiad.wmnet ? [14:49:52] dancy: sure doing [14:50:06] ty [14:50:42] (03PS1) 10Muehlenhoff: Also move the apt::pin under the buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/1073467 (https://phabricator.wikimedia.org/T374928) [14:51:08] (03CR) 10Hashar: [C:04-2] "I have raised the topic on our internal `ops` mailing list. Lets wait next week before enabling content merge." [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [14:52:36] !log dancy@deploy1003 Started deploy [releng/phatality@84c7283]: T374880 [14:52:37] 10SRE-tools, 06Infrastructure-Foundations: Output test logs of production testing of the pre switchover tasks related to databases - https://phabricator.wikimedia.org/T374972 (10jcrespo) 03NEW [14:52:39] T374880: scap phatality deployment problem - https://phabricator.wikimedia.org/T374880 [14:52:42] !log dancy@deploy1003 Finished deploy [releng/phatality@84c7283]: T374880 (duration: 00m 06s) [14:52:50] herron: Lookin' good. [14:53:05] (03PS1) 10Dzahn: add project language 'rsk' (Ruthenian/Pannonian Rusyn) [dns] - 10https://gerrit.wikimedia.org/r/1073468 (https://phabricator.wikimedia.org/T374963) [14:54:14] (03PS2) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 [14:54:56] (03CR) 10Ssingh: [C:03+1] add project language 'rsk' (Ruthenian/Pannonian Rusyn) [dns] - 10https://gerrit.wikimedia.org/r/1073468 (https://phabricator.wikimedia.org/T374963) (owner: 10Dzahn) [14:56:34] (03PS1) 10Muehlenhoff: On Bookworm create the system user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1073469 (https://phabricator.wikimedia.org/T374928) [14:56:40] jouncebot: nowandnext [14:56:40] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [14:56:40] In 0 hour(s) and 3 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1500) [14:57:02] Is anyone currently deploying? [14:57:11] I'd like to deploy a security patch if not [14:58:00] (03PS4) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [14:58:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1500). Please do the needful. [15:00:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:00:35] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:01:42] 10SRE-tools, 06Infrastructure-Foundations: Output test logs of production testing of the pre switchover tasks related to databases - https://phabricator.wikimedia.org/T374972#10153634 (10ops-monitoring-bot) cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002 [15:02:39] 10SRE-tools, 06Infrastructure-Foundations: Output test logs of production testing of the pre switchover tasks related to databases - https://phabricator.wikimedia.org/T374972#10153642 (10ops-monitoring-bot) cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002 completed... [15:03:02] !log Starting security deploy [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10153649 (10MoritzMuehlenhoff) @Jclark-ctr Hi, indeed the drive was swapped last week. I took puppetmaster1003 out of active service this morning, if we have another disk we can swap in from a decom ho... [15:04:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:21] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on cr1-eqiad with reason: reboot cloudsw1-c8-eqiad [15:04:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on cr1-eqiad with reason: reboot cloudsw1-c8-eqiad [15:05:01] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on 24 hosts with reason: reboot cloudsw1-c8-eqiad [15:05:30] (03CR) 10AikoChou: [C:03+2] ml-services: update resources config for ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073444 (owner: 10AikoChou) [15:05:36] (03CR) 10Lucas Werkmeister (WMDE): "gate-and-submit will run against the rebased version of the change, right?" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [15:05:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 24 hosts with reason: reboot cloudsw1-c8-eqiad [15:06:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:31] (03Merged) 10jenkins-bot: ml-services: update resources config for ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073444 (owner: 10AikoChou) [15:06:48] (03CR) 10AikoChou: [C:03+2] ml-services: increase cpu and memory for ref-quality isvc in exp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 (owner: 10AikoChou) [15:07:48] (03Merged) 10jenkins-bot: ml-services: increase cpu and memory for ref-quality isvc in exp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073400 (owner: 10AikoChou) [15:09:05] !log dreamyjazz Deployed security patch for T372998 [15:10:34] (03PS1) 10Elukey: profile::docker_registry_ha::registry: add sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/1073472 (https://phabricator.wikimedia.org/T374928) [15:11:30] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:20] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4005/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073472 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [15:15:02] dreamyjazz: Lemme know when you're done please. [15:15:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073472 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [15:15:49] (03CR) 10Elukey: "I was wondering the same, but I didn't have time to check those conns. From the host-overview it seems that the network bw usage dropped a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [15:16:49] PROBLEM - Docker registry HTTPS interface certificate expiry on registry2005 is CRITICAL: connect to address 10.192.16.7 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [15:17:25] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: connect to address 10.192.16.7 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [15:17:48] !log dreamyjazz Deployed security patch for T372998 [15:18:01] (03CR) 10Muehlenhoff: [C:03+1] "Nono, the change is fine. In fact I simply forgot to click +1 earlier. I mostly mentioned it as something to check before we decom the old" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [15:18:48] FIRING: PuppetFailure: Puppet has failed on registry2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:18:58] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on registry2005.codfw.wmnet with reason: WIP - working on puppet runs [15:19:12] !log installing postgresql-13 security updates [15:19:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on registry2005.codfw.wmnet with reason: WIP - working on puppet runs [15:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:45] !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@d8093b9] (releasing): (no justification provided) [15:20:29] !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@d8093b9] (releasing): (no justification provided) (duration: 00m 43s) [15:21:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=(cp2039|cp2040).codfw.wmnet [reason: depool for T373103] [15:22:52] jouncebot: nowandnext [15:22:53] For the next 0 hour(s) and 37 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1500) [15:22:53] In 0 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1600) [15:23:14] if the window is free (and dancy is done) I wouldn’t mind backporting some patches [15:23:41] OK.. Lemme upgrade scap and then I'll turn over to you [15:23:43] (but if the SRE office hours need the window that’s also okay ^^) [15:23:48] ok sure [15:23:51] * Lucas_WMDE peeks at stashbot [15:23:57] and wikibugs apparently [15:24:06] !log dancy@deploy1003 Installing scap version "4.102.1" for 211 hosts [15:24:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.073s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:25:07] Hi. I should be done. [15:25:11] Lucas_WMDE: Cloud network is borked. Switch reboot apparently affected more than expected. [15:25:18] dancy: [15:25:27] 👍🏾 [15:25:28] bd808: ack, good luck… [15:26:08] * bd808 is just a concerned bystander who is also skilled in #hugops [15:26:22] (Also the ping didn't come through BTW). The script didn't fully work as it attempted to write a file to a non-existing directory. I think I fixed the issue, but it should be resolved once a backport is performed. [15:28:17] !log dancy@deploy1003 Installation of scap version "4.102.1" completed for 211 hosts [15:28:35] Lucas_WMDE: all yours [15:28:40] thanks! [15:28:44] wb wikibugs \o/ [15:29:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.656s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:30:04] dancy: you might want to re-log that message now that stashbot is back btw [15:30:11] oh no [15:30:13] :/ [15:30:19] I scared it away [15:30:21] hehe [15:30:31] Lucas_WMDE: Could you check that the security patch for T372998 is applied when you perform the backports? [15:30:44] I can try [15:30:46] It should appear in the patches that are applied for both active wiki versions [15:30:50] I should see the patch file somewhere in the output, right? [15:30:50] yeah [15:30:54] Yes. [15:30:56] ^^ [15:31:11] (03CR) 10Scott French: [C:03+1] "LGTM, assuming the swap to poolcounter2005 seems to have gone smoothly :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [15:31:29] I think it's somewhere near the logo output for scap [15:31:34] that it gets printed [15:31:57] yeah, I have some recollection of seeing it [15:32:09] jouncebot: next [15:32:09] In 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1600) [15:32:24] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) [15:32:29] It's just me for the puppet window AFAIK [15:32:37] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) [15:32:51] So happy to wait if the backports spill over into the puppet window [15:32:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:32:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:32:59] this is gonna take some time to go through CI anyway [15:33:07] so if anyone else wants to deploy, let me know and I’ll Ctrl+C the scap [15:33:19] it’s not a problem if the backport doesn’t happen, it would just be nice to have [15:33:40] stashbot: status [15:33:40] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [15:33:47] ^ seems to be healthier now btw (cc dancy) [15:33:51] !log dancy@deploy1003 Installation of scap version "4.102.1" completed for 211 hosts [15:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:55] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 5% [puppet] - 10https://gerrit.wikimedia.org/r/1070550 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [15:33:56] yay [15:33:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [15:34:23] hm, two RemoteDisconnected in scap while waiting for gerrit so far [15:35:19] I've seen them often while waiting for backports with long gate-and-submit-wmf times [15:39:44] It would be a good day once we have parallel PHPUnit integration tests :D [15:40:19] (03CR) 10JMeybohm: [C:03+1] profile::docker_registry_ha::registry: add sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/1073472 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [15:40:22] we have them in Wikibase, actually [15:40:24] <_Gerges> Is anyone willing to merge a patch throttle? The event will start in half an hour? [15:40:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [15:40:47] you can see that some of the quibble-vendor-mysql-*-noselenium builds in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1073245 took less than 10 minutes [15:41:42] :D [15:43:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2213 db2214 es2023 pc2016 db2209 - T373103', diff saved to https://phabricator.wikimedia.org/P69221 and previous config saved to /var/cache/conftool/dbconfig/20240917-154355-arnaudb.json [15:43:59] T373103: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 [15:44:04] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:44:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on 6 hosts with reason: network maintenance T373103 [15:44:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on 6 hosts with reason: network maintenance T373103 [15:44:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [15:45:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10153864 (10ABran-WMF) d/p hosts are depooled [15:45:51] Lucas_WMDE: How do you feel about deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073475 first? [15:46:27] although based on the date range in the change, it doesn't appear to be urgent. [15:47:04] <_Gerges> There is an urgent event in a quarter of an hour. [15:47:07] yeah, I don’t know if this is the change _Gerges mentioned earlier… [15:47:19] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:47:57] <_Gerges> This is not a patch, I did not publish it. There is a quarter of an hour left until the event begins. [15:48:09] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [15:48:19] _Gerges: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073475/1/wmf-config/throttle.php mentions 2024-09-24, so it doesn't like it would help you today. [15:48:27] *doesn't seem like [15:49:07] _Gerges: is there at least a Phabricator task? [15:49:28] https://phabricator.wikimedia.org/T373468 [15:49:29] <_Gerges> I haven't published the patch that will start in a quarter of an hour, I want to make sure that someone will work on the deployed patch [15:49:36] ah, gotcha [15:49:40] Yes ,we can deploy it. [15:50:00] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:50:05] <_Gerges> Ok I'll post a patch to this now. [15:50:11] Looks like an extra step is required to handle a last-minute throttling exception: [15:50:22] https://www.irccloud.com/pastebin/YB6zrl6j/ [15:50:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet [15:52:14] Will that config deployment be done for the puppet window? [15:52:24] *by the puppet window? [15:52:39] (03PS1) 10GergesShamon: Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 [15:53:31] ugh, at least one of my gate-and-submit builds already failed [15:53:37] :( [15:53:46] I had that happen today too. There's some flakiness going on. [15:54:13] (03CR) 10CI reject: [V:04-1] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:54:32] _Gerges: please create a Phabricator task too [15:54:44] <_Gerges> This patch, if anyone can merge it. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073483 [15:54:51] I don’t want to deploy a throttle exception without a task [15:55:22] (03CR) 10CI reject: [V:04-1] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:55:22] (03PS2) 10GergesShamon: Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 (https://phabricator.wikimedia.org/T373468) [15:55:33] <_Gerges> T373468 [15:55:33] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [15:55:37] _Gerges: Does your patch need to be based on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073475/1 [15:56:16] (03CR) 10Lucas Werkmeister (WMDE): Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [15:56:30] <_Gerges> You can marge the two together. [15:56:44] jouncebot: nowandnext [15:56:44] For the next 0 hour(s) and 3 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1500) [15:56:44] In 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1600) [15:57:08] Is someone going to be around for the puppet request window? [15:57:35] my scap backport died, btw, so I’m not blocking the Puppet window [15:57:39] (I’m also not a Puppet deployer) [15:57:45] Sure. [15:59:33] PROBLEM - Host cloudsw1-c8-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:59] RECOVERY - Host cloudsw1-c8-eqiad.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 30.73 ms [16:00:01] PROBLEM - Host cloudsw1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1600) [16:00:05] Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] \o [16:01:06] Dreamy_Jazz: hi! I have a meeting conflict today, I'm hoping jhathaway is around, but I can multitask if necessary :) [16:01:33] I'm in a meeting as well, :(, but also happy to try to multitask with my bird brain [16:01:41] RECOVERY - Host cloudsw1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 35.24 ms [16:04:24] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [16:04:32] (03CR) 10GergesShamon: Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:04:52] Dreamy_Jazz: okay! I can merge this for the puppet window, but I'm not super familiar with the subject matter -- and it's complex enough I'd like it to have a +1 from someone who is [16:05:00] FIRING: Emergency syslog message: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:05:19] Okay. I am not sure who is familiar with this to be honest. [16:05:19] Dreamy_Jazz: can someone from your team have a look, or do you need help finding a reviewer? [16:05:45] I don't think anyone on my team is familiar with the wiki replica views [16:06:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt,cr1-eqiad [16:06:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt,cr1-eqiad [16:06:30] hm, okay [16:06:33] Also I don't think anyone is free to review it at the moment [16:07:23] So I do need help finding a review [16:07:50] got it -- how urgent is this? we might not be able to find a reviewer in this window but we can try and track somebody down [16:08:10] Not super urgent, but it is blocking [16:08:30] just from looking at the file history I would maybe try zabe or taavi [16:08:45] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2209.codfw.wmnet with reason: move to new switch [16:08:50] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2209.codfw.wmnet with reason: move to new switch [16:08:55] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10154026 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3098e21e-4c2c-426d-9ca2-5661232be6df) set by cmooney@cumin1002 for 0:30:00 on 1 h... [16:09:03] (03CR) 10Lucas Werkmeister (WMDE): Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:09:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [16:09:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [16:09:53] (03CR) 10TChin: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [16:09:53] Amir (ladsgroup) gave a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016066 [16:10:00] RESOLVED: Emergency syslog message: Device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:10:02] Which was modifying that file [16:10:17] yeah, he's out at a conference this week though [16:10:24] Ah. I see. [16:13:16] Dreamy_Jazz: hrm, try popping into #wikimedia-cloud and see if anyone in there is comfortable reviewing [16:13:17] Asking one of the other engineers on my team to take a look, but they are in a meeting [16:13:24] rzl: Dreamy_Jazz: 301 to dhinus [16:13:41] taavi: thanks! [16:14:35] (03PS2) 10GergesShamon: Lift IP cap on this dates 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) [16:14:48] (03Abandoned) 10GergesShamon: Lift IP cap on this dates 17/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073483 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:14:57] (03CR) 10BBlack: [C:03+1] sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:15:16] (03PS3) 10GergesShamon: Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) [16:15:40] (03CR) 10CI reject: [V:04-1] Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:17:18] They are busy in that meeting, so can't provide review. [16:18:29] (03CR) 10Hashar: "recheck 00:00:10.714 fatal: unable to look up contint2002.wikimedia.org (port 9418) (Temporary failure in name resolution)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:19:24] _Gerges: I will deploy your config change [16:19:43] Dreamy_Jazz: sorry for the speed bump :/ I won't be able to merge it until there's a second pair of eyes from someone familiar with the content, but if you get that review from someone who can't merge it, ping me any time -- don't worry about keeping it within the window [16:20:00] Fair enough. Thanks. [16:20:09] rzl: note that patch has a bit complicated deployment process [16:20:10] <_Gerges> There is an error in the patch, I don't know what it is [16:21:02] taavi: ah, should I let WMCS SRE take care of it then? [16:21:26] _Gerges: yeah that is unrelated [16:21:29] I am dpeloying it [16:21:37] jouncebot: now [16:21:37] For the next 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1600) [16:21:38] <_Gerges> @hashar: Please start quickly [16:22:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:22:57] _Gerges: and if further changes are needed ping people here :) [16:23:03] or evenutally in #wikimedia-releng [16:23:07] (03Merged) 10jenkins-bot: Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073475 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [16:23:28] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073475|Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] [16:23:32] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [16:23:43] those changes are easy to review/deploy [16:24:52] (03PS1) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) [16:24:54] rzl: probably best if you do [16:25:04] taavi: good tip, thanks [16:25:27] https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Deploy_wiki_replicas_view_change is the post-merge followup I guess [16:25:43] (03CR) 10CI reject: [V:04-1] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:26:10] !log hashar@deploy1003 hashar, gergesshamon: Backport for [[gerrit:1073475|Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:26:18] (03CR) 10Lucas Werkmeister (WMDE): "DAMMIT, `IPUtils` isn’t available during the test. I was hoping it would be 😔" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:26:19] !log hashar@deploy1003 hashar, gergesshamon: Continuing with sync [16:26:30] Dreamy_Jazz: okay, I'm changing my advice :) do try dhinus, and otherwise try in #wikimedia-cloud, and see if they can take care of you -- if you end up at a dead end, come find me and I'll see if I can help [16:27:36] (03PS2) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) [16:28:16] (03CR) 10CI reject: [V:04-1] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:28:21] (03CR) 10Hashar: "Add it to `composer.json`?! It is only used for the CI jobs and does not affect production whatsoever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:29:24] (03PS3) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) [16:30:53] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073475|Lift IP cap on this dates 17/09,24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] (duration: 07m 25s) [16:30:57] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [16:31:34] (03CR) 10Lucas Werkmeister (WMDE): "installing it via composer works, yay" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:32:36] Lucas_WMDE: https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test/2926/console [16:32:36] nice [16:32:46] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "Something isn’t right here yet – I don’t know why only one of the tests in I0c02d67a3f is failing. I’m not sure if I wrote the IP addresse" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [16:33:02] hashar: it’s supposed to be more than one error though ^^ [16:33:11] but I think I’m done for today and will look at that tomorrow [16:33:19] cause the assertTrue / assertFalse abort the test immediately [16:33:22] so nothing else is processed [16:33:38] but it’s a parametrized test [16:33:46] the other data sets should still run and report errors, shouldn’t they [16:34:10] as they did in patchset 3: https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test/2925/console [16:34:51] testThrottlingExceptionsIPsValidAndNotPrivate with data set #6 [16:35:02] that is set number SIX [16:35:14] so yeah the other ran/passed [16:37:01] (03PS2) 10Jgreen: Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942) [16:38:10] Lucas_WMDE: yeah that will remain a mystery for tonight, I gotta prepare dinner [16:38:16] but maybe I can dig into it tomorrow [16:38:16] :) [16:38:23] I don't know what is happening [16:40:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10154180 (10cmooney) 05Open→03Resolved a:03cmooney Move completed today without issue. [16:40:58] (03CR) 10Jgreen: [C:03+2] Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942) (owner: 10Jgreen) [16:41:52] Lucas_WMDE: hint: composer run -- phpunit --testdox --filter testThrottlingExceptionsIPsValidAndNotPrivate tests/ThrottleTest.php [16:42:01] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2022.codfw.wmnet [16:42:12] that shows the 8 tests, 7 of them are passing [16:42:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2022.codfw.wmnet [16:42:44] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2046.codfw.wmnet [16:42:55] anyway see you tomorrow [16:43:20] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2046.codfw.wmnet [16:43:31] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2047.codfw.wmnet [16:43:34] (03CR) 10Jcrespo: "small bug" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [16:44:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2047.codfw.wmnet [16:44:17] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2056.codfw.wmnet [16:44:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2056.codfw.wmnet [16:45:04] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2366.codfw.wmnet [16:45:37] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2366.codfw.wmnet [16:45:47] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2367.codfw.wmnet [16:45:55] !log swfrench@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2278.codfw.wmnet [16:46:19] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2367.codfw.wmnet [16:46:29] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2368.codfw.wmnet [16:46:58] !log swfrench@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2279.codfw.wmnet [16:47:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2368.codfw.wmnet [16:47:12] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2369.codfw.wmnet [16:47:55] (03PS10) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [16:48:11] (03CR) 10Volans: "Addressed comment, also added a log message for the 10s sleep" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [16:48:41] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 24 hosts with reason: reboot cloudsw1-c8-eqiad [16:48:42] rzl: Dreamy_Jazz: catching up with backscroll, I can probably help but not until tomorrow [16:48:58] is the patch this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073430 [16:49:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 24 hosts with reason: reboot cloudsw1-c8-eqiad [16:50:22] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2369.codfw.wmnet [16:50:32] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2370.codfw.wmnet [16:51:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2370.codfw.wmnet [16:51:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2371.codfw.wmnet [16:51:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2371.codfw.wmnet [16:51:58] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2372.codfw.wmnet [16:52:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2372.codfw.wmnet [16:52:41] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2373.codfw.wmnet [16:53:17] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2373.codfw.wmnet [16:53:25] (03PS5) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [16:53:28] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2374.codfw.wmnet [16:54:04] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2374.codfw.wmnet [16:54:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2375.codfw.wmnet [16:54:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2375.codfw.wmnet [16:54:54] (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [16:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:54:58] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2376.codfw.wmnet [16:55:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10154232 (10phaultfinder) [16:55:12] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10154233 (10RobH) I'm now looking into these. Overall just these specific servers report heat issues while they are weighted the same as other cp hosts within the same fle... [16:55:12] (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [16:57:08] (03CR) 10Bking: [C:03+2] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [16:57:14] (03PS11) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [16:58:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2376.codfw.wmnet [16:58:27] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986 (10RobH) 03NEW [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1700) [17:01:31] 06SRE, 10Cloud-VPS, 06Data-Engineering, 10Data-Services, and 4 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10154270 (10fnegri) [17:02:00] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10154255 (10cmooney) 05Open→03Resolved Upgrade was successful today on cloudsw1-c8-codfw, the last of these... [17:02:15] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10154274 (10fnegri) [17:02:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10154276 (10cmooney) All hosts moved successfully and responding to ping again. [17:05:31] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2022.codfw.wmnet [17:05:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2022.codfw.wmnet [17:05:44] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2046.codfw.wmnet [17:05:46] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2046.codfw.wmnet [17:05:57] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2047.codfw.wmnet [17:05:59] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2047.codfw.wmnet [17:06:03] !log swfrench@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2278.codfw.wmnet [17:06:09] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2056.codfw.wmnet [17:06:11] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2056.codfw.wmnet [17:06:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet [17:06:22] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2366.codfw.wmnet [17:06:24] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2366.codfw.wmnet [17:06:35] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2367.codfw.wmnet [17:06:37] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2367.codfw.wmnet [17:06:38] !log swfrench@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2279.codfw.wmnet [17:06:47] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2368.codfw.wmnet [17:06:49] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2368.codfw.wmnet [17:07:00] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2369.codfw.wmnet [17:07:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2369.codfw.wmnet [17:07:12] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2370.codfw.wmnet [17:07:15] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2370.codfw.wmnet [17:07:25] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2371.codfw.wmnet [17:07:27] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2371.codfw.wmnet [17:07:38] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2372.codfw.wmnet [17:07:40] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2372.codfw.wmnet [17:07:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: T373103', diff saved to https://phabricator.wikimedia.org/P69222 and previous config saved to /var/cache/conftool/dbconfig/20240917-170745-arnaudb.json [17:07:49] T373103: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 [17:07:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 25%: T373103', diff saved to https://phabricator.wikimedia.org/P69223 and previous config saved to /var/cache/conftool/dbconfig/20240917-170749-arnaudb.json [17:07:50] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2373.codfw.wmnet [17:07:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2373.codfw.wmnet [17:07:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: T373103', diff saved to https://phabricator.wikimedia.org/P69224 and previous config saved to /var/cache/conftool/dbconfig/20240917-170755-arnaudb.json [17:08:03] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2374.codfw.wmnet [17:08:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2374.codfw.wmnet [17:08:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: T373103', diff saved to https://phabricator.wikimedia.org/P69225 and previous config saved to /var/cache/conftool/dbconfig/20240917-170805-arnaudb.json [17:08:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2375.codfw.wmnet [17:08:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2375.codfw.wmnet [17:08:28] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2376.codfw.wmnet [17:08:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2376.codfw.wmnet [17:09:43] (03PS1) 10AOkoth: wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) [17:10:05] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [17:10:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=(cp2039|cp2040).codfw.wmnet [reason: [maint done] depool for T373103] [17:10:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10154320 (10ABran-WMF) hosts are repooling [17:13:39] (03PS12) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [17:15:21] jouncebot: now [17:15:21] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1700) [17:16:25] looks like an empty window. dancy: decent time to deploy the new scap release? [17:16:50] Yep! [17:17:04] okie dokie [17:18:20] !log dduvall@deploy1003 Installing scap version "4.103.0" for 211 hosts [17:19:27] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10154357 (10RobH) [17:20:31] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10154362 (10RobH) 05Open→03Stalled Stalling parent task while working on fixing the esams hosts (esams is easier to get parts in and out than magru, so esams is better... [17:21:36] (03CR) 10David Caro: [C:03+1] mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [17:22:30] !log dduvall@deploy1003 Installation of scap version "4.103.0" completed for 211 hosts [17:22:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: T373103', diff saved to https://phabricator.wikimedia.org/P69226 and previous config saved to /var/cache/conftool/dbconfig/20240917-172250-arnaudb.json [17:22:55] T373103: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 [17:22:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: T373103', diff saved to https://phabricator.wikimedia.org/P69227 and previous config saved to /var/cache/conftool/dbconfig/20240917-172255-arnaudb.json [17:23:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: T373103', diff saved to https://phabricator.wikimedia.org/P69228 and previous config saved to /var/cache/conftool/dbconfig/20240917-172300-arnaudb.json [17:23:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: T373103', diff saved to https://phabricator.wikimedia.org/P69229 and previous config saved to /var/cache/conftool/dbconfig/20240917-172310-arnaudb.json [17:25:23] (03CR) 10TChin: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [17:26:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10154382 (10cmooney) 05Open→03Resolved Thankfully got a call from a really good Equinix engineer today who was ab... [17:28:55] (03CR) 10Elukey: [V:03+1 C:03+2] profile::docker_registry_ha::registry: add sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/1073472 (https://phabricator.wikimedia.org/T374928) (owner: 10Elukey) [17:32:54] RECOVERY - Docker registry HTTPS interface certificate expiry on registry2005 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Tue 15 Oct 2024 01:20:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [17:33:26] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Docker [17:33:40] new vm --^ [17:33:44] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1073412 (owner: 10Filippo Giunchedi) [17:33:56] RECOVERY - MD RAID on aqs1014 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:37:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: T373103', diff saved to https://phabricator.wikimedia.org/P69230 and previous config saved to /var/cache/conftool/dbconfig/20240917-173756-arnaudb.json [17:38:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: T373103', diff saved to https://phabricator.wikimedia.org/P69231 and previous config saved to /var/cache/conftool/dbconfig/20240917-173801-arnaudb.json [17:38:02] T373103: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 [17:38:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: T373103', diff saved to https://phabricator.wikimedia.org/P69232 and previous config saved to /var/cache/conftool/dbconfig/20240917-173806-arnaudb.json [17:38:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: T373103', diff saved to https://phabricator.wikimedia.org/P69233 and previous config saved to /var/cache/conftool/dbconfig/20240917-173816-arnaudb.json [17:39:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host registry2005.codfw.wmnet with OS bookworm [17:39:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2005.codfw.wmnet [17:40:27] (03CR) 10Ssingh: [C:03+1] wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:41:05] (03CR) 10Elukey: [C:04-1] "Thanks all for the reviews! I just noticed in the task's description:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [17:43:22] (03CR) 10Dzahn: [C:03+2] add project language 'rsk' (Ruthenian/Pannonian Rusyn) [dns] - 10https://gerrit.wikimedia.org/r/1073468 (https://phabricator.wikimedia.org/T374963) (owner: 10Dzahn) [17:43:26] (03PS2) 10Dzahn: add project language 'rsk' (Ruthenian/Pannonian Rusyn) [dns] - 10https://gerrit.wikimedia.org/r/1073468 (https://phabricator.wikimedia.org/T374963) [17:43:26] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@d93f2c7] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/83 [17:43:44] (03CR) 10Scott French: "Ah, I suspect that may answer my question re: safety of transient excursions over concurrency limits. Thanks for flagging!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [17:44:09] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@d93f2c7] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/83 (duration: 01m 13s) [17:44:34] (03CR) 10Elukey: [C:04-1] "I assume that Alex meant "don't reimage more than one node at the same time", so a deployment should be good since pods are re-created inc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [17:46:32] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1073468 (https://phabricator.wikimedia.org/T374963) (owner: 10Dzahn) [17:46:45] (03CR) 10JHathaway: [C:03+2] mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [17:48:37] (03CR) 10JHathaway: [C:03+2] mcrouter: remove puppet alert [puppet] - 10https://gerrit.wikimedia.org/r/1073261 (owner: 10JHathaway) [17:53:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: T373103', diff saved to https://phabricator.wikimedia.org/P69234 and previous config saved to /var/cache/conftool/dbconfig/20240917-175302-arnaudb.json [17:53:06] T373103: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 [17:53:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: T373103', diff saved to https://phabricator.wikimedia.org/P69235 and previous config saved to /var/cache/conftool/dbconfig/20240917-175306-arnaudb.json [17:53:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: T373103', diff saved to https://phabricator.wikimedia.org/P69236 and previous config saved to /var/cache/conftool/dbconfig/20240917-175311-arnaudb.json [17:53:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: T373103', diff saved to https://phabricator.wikimedia.org/P69237 and previous config saved to /var/cache/conftool/dbconfig/20240917-175321-arnaudb.json [17:54:06] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:54:10] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [17:55:20] (03PS1) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:55:20] (03CR) 10Ssingh: "Nice first attempt! Comments in-line:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:56:41] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: in setup [17:56:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: in setup [17:59:22] (03Abandoned) 10Ryan Kemper: wdqs: store graph type in data_loaded file [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:00:04] jnuche and dduvall: MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T1800). Please do the needful. [18:00:21] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:00:23] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:01:46] ryankemper: inflatador: wdqs-scholarly? [18:02:44] was just about to ask if anyone is turning anything up or down ^ [18:02:58] sukhe: ah I see what happened here...ran a data transfer cookbook and neglected to consider it was going to depool the only 2 hosts [18:03:01] the WDQS data transfer is running [18:03:13] It's pooling right now so it should resolve, will force a recheck [18:03:18] (03CR) 10Hashar: "Works on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [18:03:44] (03PS4) 10Hashar: Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [18:04:01] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:04:03] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:04:24] (03CR) 10CI reject: [V:04-1] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [18:04:26] alright, back to normal [18:04:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:04:49] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [18:04:54] in future I will need to choose an eqiad host and source and codfw as dest or vice versa to work around this [18:05:11] ryankemper: thanks! [18:05:43] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs1023.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:05:54] alright, this next one should not trigger an alert [18:06:26] (03PS5) 10Hashar: Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [18:06:49] Lucas_WMDE: found it https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073487/2..4/tests/ThrottleTest.php [18:07:19] (03CR) 10Hashar: [C:03+1] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [18:11:23] (03PS2) 10Elukey: Swap poolcounter2004 with poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) [18:11:23] (03PS1) 10Elukey: Swap poolcounter1004 with poolcounter1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) [18:11:24] (03PS1) 10Elukey: Swap poolcounter1005 with poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) [18:12:13] (03CR) 10Elukey: "Broken down the patch into three steps, let's play it safe :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [18:14:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:16:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs1023.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:16:28] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [18:16:57] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:19:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:27:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer categories jnl) xfer categories from wdqs2023.codfw.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:27:56] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [18:29:29] (03PS1) 10Dzahn: peopleweb: make home dir sizes in warning message human readable [puppet] - 10https://gerrit.wikimedia.org/r/1073508 (https://phabricator.wikimedia.org/T343364) [18:34:37] 06SRE, 10WMF-General-or-Unknown: Some sites try and fail to serve favicon.ico - https://phabricator.wikimedia.org/T374997 (10matmarex) 03NEW [18:43:55] FIRING: [3x] SystemdUnitFailed: wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:24] (03PS1) 10Ryan Kemper: wdqs categories: remove redundant ping check [puppet] - 10https://gerrit.wikimedia.org/r/1073510 (https://phabricator.wikimedia.org/T374916) [18:45:24] (03PS2) 10Bking: wdqs categories: remove redundant ping check [puppet] - 10https://gerrit.wikimedia.org/r/1073510 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [18:46:17] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:48:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073510 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [18:51:20] 10ops-eqiad, 06DC-Ops: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5 - https://phabricator.wikimedia.org/T375000 (10wiki_willy) 03NEW [18:53:23] PROBLEM - Hadoop DataNode on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [18:53:23] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:56:23] RECOVERY - Hadoop DataNode on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [18:56:23] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:08:47] PROBLEM - Hadoop DataNode on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [19:09:23] PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:10:32] (03PS3) 10Bartosz Dziewoński: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [19:11:06] (03CR) 10Bartosz Dziewoński: [C:03+1] "Removed them from the labs config as well. This looks good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [19:13:17] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:15:30] (03CR) 10Dzahn: [C:03+2] peopleweb: make home dir sizes in warning message human readable [puppet] - 10https://gerrit.wikimedia.org/r/1073508 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [19:19:30] (03CR) 10Dzahn: "I would expect that the role is applied before the active_host is changed. This way everything would have to go perfect on the first run." [puppet] - 10https://gerrit.wikimedia.org/r/1073283 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [19:20:23] RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:20:47] RECOVERY - Hadoop DataNode on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [19:25:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10154851 (10phaultfinder) [19:26:46] (03PS1) 10Scott French: sre.discovery.datacenter: restrict checks to active authdns hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1073524 (https://phabricator.wikimedia.org/T374047) [19:29:13] PROBLEM - SSH on an-worker1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:30:12] (03PS1) 10Dzahn: peopleweb: raise home dir size limit, make recipient configurable [puppet] - 10https://gerrit.wikimedia.org/r/1073526 (https://phabricator.wikimedia.org/T343364) [19:30:21] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:32:03] RECOVERY - SSH on an-worker1078 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:34:03] (03CR) 10Scott French: [C:03+1] "Swapping one host at a time is at least empirically safe, since you just did it for 2005 without issue. SGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [19:36:19] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:36:52] (03PS1) 10Ryan Kemper: wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) [19:38:34] (03PS1) 10JHathaway: tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) [19:39:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:39:06] (03CR) 10CI reject: [V:04-1] tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:42:43] (03PS1) 10JHathaway: tftpboot: purge old files [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) [19:42:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:43:10] (03CR) 10CI reject: [V:04-1] tftpboot: purge old files [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:44:36] (03PS2) 10JHathaway: tftpboot: purge old files [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) [19:45:14] (03PS2) 10JHathaway: tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) [19:45:48] (03CR) 10CI reject: [V:04-1] tftpboot: squash puppetserver log warning. [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:46:12] (03PS1) 10Ryan Kemper: wdqs max lag: specify specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) [19:46:42] (03CR) 10Bking: [C:03+1] wdqs categories: remove redundant ping check [puppet] - 10https://gerrit.wikimedia.org/r/1073510 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [19:47:39] (03PS2) 10Ryan Kemper: wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) [19:48:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [19:49:10] !log pyrra upgraded to 0.7.7-1 [19:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:01] (03PS1) 10Ryan Kemper: wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 [19:52:00] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006 (10Dreamy_Jazz) 03NEW [19:57:55] (03CR) 10Ryan Kemper: [C:03+2] wdqs categories: remove redundant ping check [puppet] - 10https://gerrit.wikimedia.org/r/1073510 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240917T2000). nyaa~ [20:00:04] ejegg, _Gerges, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:19] hi [20:02:27] hello, I'm here :) [20:07:17] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:07:21] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:07:26] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:09:51] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) [20:11:14] do we have any deployers around for this window? [20:12:34] I'd run it [20:13:23] stupid false conflicts [20:13:29] (03PS14) 10Ejegg: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) [20:13:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [20:14:39] (03Merged) 10jenkins-bot: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [20:14:49] I already deployed the throttling change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073475 [20:14:54] at some point earlier today [20:14:59] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:649942|Assign the API portal to the Wikimedia group for CentralNotice (T270308)]] [20:15:03] T270308: API portal showing Wikipedia CentralNotice banners - https://phabricator.wikimedia.org/T270308 [20:15:04] ejegg: is there something to test? [20:15:43] hmm, not until there's a banner live on the main wikis [20:15:47] let me see [20:16:12] I get a purple banner for Wiki Loves Sport 2024 [20:16:19] so at lesat there is some kind of banner showing up :] [20:16:40] but your change is not on the test servers yet [20:17:05] oho, so I still don't see anything on api portal [20:17:09] !log hashar@deploy1003 ejegg, hashar: Backport for [[gerrit:649942|Assign the API portal to the Wikimedia group for CentralNotice (T270308)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:17:12] hmm, nor the WLS banner [20:17:20] guess the project banner is just logged-in users [20:17:31] logging back in [20:17:42] it is in on the debug servers now [20:17:45] ok now I see a different project banner on enwiki [20:18:08] and the same banner on the api portal when logged in [20:18:38] can you remind me how to find the test version of api.wikimedia.org ? [20:19:20] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian) [20:19:54] oh right, it's the request header [20:19:54] ejegg: you need the X-Debug extension [20:19:57] yeah [20:20:13] then if there is not much to test, I will just continue [20:21:41] ok, with the header requesting k8s-mwdebug and a hard reload, I see no more banner on api. [20:21:49] that's all we needed! [20:22:12] oh hi hashar :D thanks for deploying the logging changes earlier [20:22:32] !log hashar@deploy1003 ejegg, hashar: Continuing with sync [20:22:36] ejegg: success! [20:22:40] great! [20:22:48] thanks hashar [20:23:14] MatmaRex: yeah it went fine! There are some additional error in the `http` logging bucket which comes from connection timeout when interacting with the Swift frontend (POST or GET) [20:23:55] my guess is the jobs are hammering Swift too much, but who knows really. I filed a task [20:24:32] (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) [20:25:21] ejegg: you are welcome! Thank you for babysitting all those banners! [20:25:42] (03CR) 10Hashar: "Oops. Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:25:48] (03PS4) 10Bartosz Dziewoński: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:26:31] MatmaRex: if you can extend, I'd deploy that cleanup change as well [20:26:35] for the logging series [20:26:43] hashar: sure [20:26:45] and after that I guess I will close the task [20:26:52] yep [20:27:06] the fun thing is that the messages marked as "error" are not part of the log triage we are doing [20:27:10] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:649942|Assign the API portal to the Wikimedia group for CentralNotice (T270308)]] (duration: 12m 11s) [20:27:15] T270308: API portal showing Wikipedia CentralNotice banners - https://phabricator.wikimedia.org/T270308 [20:27:18] we only triages messages from the exception and error channels [20:28:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [20:28:50] that 3rd party cookie thing, I am tempted to write back to Google telling them: "we can't do the change, it costs to much" [20:28:54] :D [20:29:04] (03PS4) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) [20:29:38] that patch just gets rid of some googley junk [20:29:42] (03CR) 10TrainBranchBot: "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [20:29:43] after they deprecated their deprecation [20:29:55] so we hard deprecate them? [20:30:17] heh [20:30:27] (03Merged) 10jenkins-bot: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [20:30:28] (03CR) 10Bking: [C:03+1] wdqs max lag: specify specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [20:30:29] it looks like third party cookies are here to stay, at least in chrome [20:30:41] becaues it would cost their ad business too much to remove them [20:30:44] (03CR) 10Bking: [V:03+1] wdqs max lag: specify specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [20:30:45] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1067390|Revert "Enter deprecation trial for third-party cookie blocking" (T359957)]] [20:30:49] T359957: Enroll in Chrome third-party cookies deprecation trial - https://phabricator.wikimedia.org/T359957 [20:30:53] (03CR) 10Bking: [C:03+1] wdqs max lag: specify specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [20:32:10] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20240917 [20:32:46] !log hashar@deploy1003 tgr, hashar: Backport for [[gerrit:1067390|Revert "Enter deprecation trial for third-party cookie blocking" (T359957)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:32:50] !log hashar@deploy1003 tgr, hashar: Continuing with sync [20:33:03] (03PS4) 10Bartosz Dziewoński: Improve $wgFooterIcons override, simplify $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 [20:33:56] MatmaRex: for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1071712 [20:34:00] (03CR) 10Scott French: "Thank you both for the discussion. Opened https://phabricator.wikimedia.org/T375014 summarizing what we've talked about here." [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [20:34:11] shouldn't that pass through the web team / T256190 ? [20:34:11] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [20:34:20] 10SRE-tools, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014 (10Scott_French) 03NEW [20:35:54] (03PS1) 10BryanDavis: toolhub: Bump container to 2024-09-17-200155-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073548 (https://phabricator.wikimedia.org/T373976) [20:35:55] hashar: i don't think so, i'm not changing the actual images. i had no idea they were interested in that anyway [20:36:34] :) [20:36:40] hashar: i can ping some more people on the patch if you're unsure though [20:36:44] dzahn@cumin2002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [20:36:55] na it is ok [20:37:00] i thought it was mostly amir making these changes, and he gave it a +1 [20:37:22] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067390|Revert "Enter deprecation trial for third-party cookie blocking" (T359957)]] (duration: 06m 37s) [20:37:26] T359957: Enroll in Chrome third-party cookies deprecation trial - https://phabricator.wikimedia.org/T359957 [20:37:33] we kind of lost the +1 votes though [20:37:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [20:38:24] (03Merged) 10jenkins-bot: Improve $wgFooterIcons override, simplify $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [20:38:43] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1071712|Improve $wgFooterIcons override, simplify $wmgWikimediaIcon]] [20:39:34] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container to 2024-09-17-200155-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073548 (https://phabricator.wikimedia.org/T373976) (owner: 10BryanDavis) [20:40:38] (03Merged) 10jenkins-bot: toolhub: Bump container to 2024-09-17-200155-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073548 (https://phabricator.wikimedia.org/T373976) (owner: 10BryanDavis) [20:40:52] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1071712|Improve $wgFooterIcons override, simplify $wmgWikimediaIcon]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:41:27] MatmaRex: I can still see icons at the bottom of https://www.mediawiki.org/wiki/MediaWiki :) [20:41:29] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [20:42:03] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [20:42:15] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [20:42:29] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply [20:42:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20240917 [20:42:45] (03PS5) 10Bartosz Dziewoński: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:42:47] pff [20:43:10] hashar: yep, and it's using the correct markup now [20:43:24] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [20:44:07] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [20:45:01] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [20:46:48] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071712|Improve $wgFooterIcons override, simplify $wmgWikimediaIcon]] (duration: 08m 05s) [20:47:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:48:02] (03Merged) 10jenkins-bot: logging: rm per channel 'error' logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073408 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [20:48:23] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]] [20:48:29] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [20:48:37] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20240917 [20:50:28] !log hashar@deploy1003 hashar: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:52:15] !log hashar@deploy1003 hashar: Continuing with sync [20:53:12] MatmaRex: I think that concludes the five years adventure to have error level logs logged by default `\o/` [20:53:28] dzahn@cumin2002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade. [20:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:55:07] hashar: :D [20:55:57] last time I gave it a try I was hitting a wall with the different layers of configuration [20:56:05] and the mixed up / confusion in settings/values [20:56:20] anyway [20:56:46] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]] (duration: 08m 22s) [20:56:49] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [20:57:34] I am happy to finally mark it resolved! [21:01:28] (03PS2) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [21:07:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release 20240917 [21:10:54] Hey all - I’d like to try to deploy a sec config patch for https://phabricator.wikimedia.org/T374438 now. Any objections? [21:12:49] (03PS2) 10Bking: wdqs max lag: target specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [21:13:32] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [21:18:06] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20240917 [21:18:47] gitlab production needs an upgrade. expect a short downtime soonish. [21:21:44] !log UTC late backport window is completed [21:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:08] !log Deployed mitigation for T374438 [21:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:01] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:27:22] PROBLEM - Host cr1-magru is DOWN: PING CRITICAL - Packet loss = 100% [21:27:23] PROBLEM - Host cr2-magru is DOWN: PING CRITICAL - Packet loss = 100% [21:27:38] RECOVERY - Host cr1-magru is UP: PING OK - Packet loss = 0%, RTA = 137.12 ms [21:27:47] !incidents [21:27:49] 5177 (UNACKED) Host cr1-magru - PING - Packet loss = 100% [21:27:49] 5178 (UNACKED) Host cr2-magru - PING - Packet loss = 100% [21:27:50] RECOVERY - Host cr2-magru is UP: PING OK - Packet loss = 0%, RTA = 138.24 ms [21:28:16] !incidents [21:28:17] 5177 (ACKED) Host cr1-magru - PING - Packet loss = 100% [21:28:17] 5178 (ACKED) Host cr2-magru - PING - Packet loss = 100% [21:30:07] there was a brief monitoring(?) flap last week in magru, this may be the same [21:30:48] T374401 is what I had in mind [21:30:48] T374401: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401 [21:30:52] routers are up [21:30:55] yeah [21:31:04] not cool that it happened again :( [21:31:12] glad you knew about the previous one [21:31:14] let's check to make sure if it was an actual flap otherwise we should depool magru [21:31:25] BGP status is just WARN now [21:31:29] not CRIT [21:31:44] let me hop over the librenms and check session uptime [21:31:48] both hosts are up per Icinga UI [21:31:54] just waiting for the bot to say so [21:31:58] 9:31PM up 138 days, 12:16, 1 users, load averages: 0.32, 0.30, 0.26 [21:32:10] mutante: note that last time also, we never got a recovery on Splunk [21:32:17] see the task above that swfrench-wmf opened [21:32:24] so we may never see the recovery there [21:32:35] aha, interesting [21:32:39] thanks [21:32:58] well, we have 0 hosts down on the web UI. good [21:33:02] host seems to have 138 days uptime [21:33:08] BGP sessions length in days [21:33:14] so yeah, monitoring flap for sure [21:33:15] seeing the same, yeah [21:33:52] the most weird thing is that the two times we have seen this [21:33:56] both have been cr*-magru [21:34:50] do you happen to know what the deal is with the mgmt network for accessing these? e.g., I winder if _it_ might have a problem [21:35:25] asw1-b3-magru.mgmt and asw1-b4-magru.mgmt [21:35:27] both look fine though [21:35:42] I mean to the extent I can see [21:35:44] by the way, tomorrow we are switching to the new monitoring server [21:37:18] sukhe: got it, thanks for looking [21:37:30] let me leave a comment on that ticket [21:37:41] mutante: thanks, was going to update [21:37:43] please do [21:38:15] thanks, mutante [21:38:40] swfrench-wmf: [21:38:40] Sep 17 21:27:39 alert2002 icinga[2078676]: SERVICE ALERT: asw1-b3-magru.mgmt;BGP status;CRITICAL;HARD;2;BGP CRITICAL - No response from remote host "10.140.128.4" [21:39:02] ahhhh [21:39:19] Sep 17 21:26:36 alert2002 icinga[2078676]: SERVICE ALERT: asw1-b3-magru.mgmt;BGP status;CRITICAL;SOFT;1;BGP CRITICAL - No response from remote host "10.140.128.4" [21:39:22] Sep 17 21:26:37 alert2002 icinga[2078676]: SERVICE ALERT: asw1-b4-magru.mgmt;BGP status;CRITICAL;SOFT;1;BGP CRITICAL - No response from remote host "10.140.128.5" [21:39:25] Sep 17 21:26:47 alert2002 icinga[2078676]: HOST ALERT: asw1-b3-magru.mgmt;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100% [21:39:28] Sep 17 21:26:49 alert2002 icinga[2078676]: HOST ALERT: asw1-b4-magru.mgmt;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100% [21:39:31] I should stop spamming here but yeah [21:39:50] should we call this resolved now? [21:40:05] not the ticket, the incident in VO [21:40:11] mutante: yeah you should mark it because it won't fix itself [21:40:17] which is another issue that olly is looking into [21:40:21] I just saw your comment about that.. [21:40:29] which, again, is weird because it should resolve itself because it clearly did resolve in Icinga [21:41:06] !incidents [21:41:07] 5178 (RESOLVED) Host cr2-magru - PING - Packet loss = 100% [21:41:07] 5177 (RESOLVED) Host cr1-magru - PING - Packet loss = 100% [21:41:16] I have to leave now but please note (since it's new), if it happens again: cumin host, sudo cookbook sre.dns.admin depool magru [21:41:19] thanks both [21:41:25] agreed, that should happen by itself [21:41:32] thanks sukhe :) [21:41:36] (in case they are actually down which I doubt!) [21:41:39] thank you both as well! [21:41:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10155444 (10Jclark-ctr) @MoritzMuehlenhoff So i did notice that idrac held onto previous drive serial number in inventory. We have had a number of issues with mdadm raids recently failing right af... [21:43:20] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Release-Engineering-Team (Seen): Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635#10155446 (10hashar) Acceptance tests run by CI would be quite nice to have, specially for non SRE since that helps build confidence a given pa... [21:49:00] (03PS9) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [21:51:14] (03CR) 10Bking: [C:03+1] cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol) [21:54:33] (03CR) 10Bking: [C:03+1] airflow: ensure each airflow release store logs to a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [21:56:48] (03PS1) 10Ebernhardson: Add a private variant of the cirrus update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) [21:59:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10155465 (10Jclark-ctr) @MoritzMuehlenhoff Sda has been replaced Again New Disk ` Device: /dev/sda ID_SERIAL=SSDSC2KG240G7R_PHYM812600BL240AGN ID_SERIAL_SHORT=PHYM812600BL240AGN ID_PATH=pci-000... [22:00:00] (03PS1) 10Ebernhardson: [WIP] cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) [22:00:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10155467 (10Jclark-ctr) Hardware inventory in idrac still list old even after idrac reboot SerialNumber PHYM812500AP240AGN [22:10:22] (03PS3) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [22:13:23] (03CR) 10Bking: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [22:18:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10155494 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced pdu [22:23:53] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [22:43:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:03] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037 (10phaultfinder) 03NEW [22:51:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10155528 (10Dwisehaupt) 05Open→03Resolved All three hosts are built. Any follow on configuration or adjustments will be in new tasks. [23:12:38] (03PS2) 10Dzahn: peopleweb: raise home dir size limit, make recipient configurable [puppet] - 10https://gerrit.wikimedia.org/r/1073526 (https://phabricator.wikimedia.org/T343364) [23:16:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1073526/4008/people1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1073526 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [23:17:38] (03CR) 10Dzahn: [C:03+2] peopleweb: raise home dir size limit, make recipient configurable [puppet] - 10https://gerrit.wikimedia.org/r/1073526 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073570 [23:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073570 (owner: 10TrainBranchBot) [23:56:33] dzahn@cumin2002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade. [23:56:33] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: security release 20240917