[00:03:11] PROBLEM - SSH on puppetserver1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:21] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:06:03] RECOVERY - SSH on puppetserver1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:09:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:10:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071722 (owner: 10TrainBranchBot) [00:10:35] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:14:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:15:21] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:22:17] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:27:19] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406 (10jeremyb) 03NEW [00:28:29] (03PS12) 10Ejegg: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) [00:32:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:06:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132300 (10Papaul) [01:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.22 [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071725 (https://phabricator.wikimedia.org/T373641) [01:08:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.22 [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071725 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [01:08:25] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132304 (10Papaul) Cluster creation complete ` root@pfw1-codfw# run show chassis cluster status Cluster ID: 1 Node Priority Status... [01:08:27] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374407 (10phaultfinder) 03NEW [01:11:20] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10132313 (10Papaul) [01:31:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.22 [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071725 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [01:33:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:38:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:48:50] (03PS1) 10Bartosz Dziewoński: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 [01:51:15] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:52:35] PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:59:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T0200) [02:04:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:13:35] RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:14:39] PROBLEM - MD RAID on wikikube-worker2092 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:14:40] ACKNOWLEDGEMENT - MD RAID on wikikube-worker2092 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T374409 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:14:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409 (10ops-monitoring-bot) 03NEW [02:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:21:14] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [02:23:09] (03PS1) 10Jcrespo: backup: Increase the number of max volumes on the productionEqiad pool [puppet] - 10https://gerrit.wikimedia.org/r/1071730 (https://phabricator.wikimedia.org/T374410) [02:25:00] (03PS2) 10Krinkle: errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (https://phabricator.wikimedia.org/T291192) (owner: 10Lucas Werkmeister) [02:26:04] (03CR) 10Krinkle: [C:03+1] errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (https://phabricator.wikimedia.org/T291192) (owner: 10Lucas Werkmeister) [02:26:19] (03CR) 10Krinkle: [C:03+1] errorpage: Remove redundant 'unknown' $reqId fallback [puppet] - 10https://gerrit.wikimedia.org/r/1071714 (owner: 10Lucas Werkmeister) [02:27:21] (03CR) 10Jcrespo: [C:03+2] backup: Increase the number of max volumes on the productionEqiad pool [puppet] - 10https://gerrit.wikimedia.org/r/1071730 (https://phabricator.wikimedia.org/T374410) (owner: 10Jcrespo) [02:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:26] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374407#10132428 (10phaultfinder) [02:48:54] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374407#10132450 (10phaultfinder) [02:55:58] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10132451 (10Dwisehaupt) [02:58:40] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10132458 (10Dwisehaupt) [02:59:09] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10132459 (10Dwisehaupt) [02:59:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T0300) [03:00:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:28] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071733 (https://phabricator.wikimedia.org/T373641) [03:01:30] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071733 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [03:02:11] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071733 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [03:02:32] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.22 refs T373641 [03:02:35] T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641 [03:05:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [03:19:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:20:13] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:30:13] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:45:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:45:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:39] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.22 refs T373641 (duration: 45m 06s) [03:47:52] T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641 [03:50:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:50:46] (03CR) 10Dzahn: [C:03+1] backup: Increase the number of max volumes on the productionEqiad pool [puppet] - 10https://gerrit.wikimedia.org/r/1071730 (https://phabricator.wikimedia.org/T374410) (owner: 10Jcrespo) [03:51:59] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, parse2009.codfw.wmnet, wikikube-worker2084.codfw.wmnet, mw2443.codfw.wmnet, mw2337.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2041.codfw.wmnet, [03:51:59] e-worker2002.codfw.wmnet, mw2313.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2048.codfw.wmnet, kubernetes2044.codfw.wmnet, wikikube-worker2073.codfw.wmnet, wikikube-worker2106.codfw.wmnet, mw2301.codfw.wmnet, kubernetes2040.codfw.wmnet, mw2417.codfw.wmnet, mw2372.codfw.wmnet, parse2008.codfw.wmnet, mw2376.codfw.wmnet, mw2426.codfw.wmnet, wiki [03:51:59] ker2066.codfw.wmnet, wikikube-worker2003.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2004.codfw.wmnet, mw2414.codfw.wmnet, mw2450.codfw.wmnet, wikik https://wikitech.wikimedia.org/wiki/PyBal [03:52:56] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:52:59] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:57:56] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T0400) [04:01:00] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.19 (duration: 00m 58s) [04:12:52] (03PS1) 10KartikMistry: Update cxserver to 2024-08-28-053620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071740 [04:18:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:30:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:40:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:49:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:04:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:07:03] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2086.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, parse2009.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2065.codf [05:07:03] mw2313.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, mw2394.codfw.wmnet, mw2314.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2098.codfw.wmnet, wikikube-worker2105.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2101.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2048.codfw.wmnet, kubernetes2051.codfw.wmnet, wikiku [05:07:03] r2106.codfw.wmnet, mw2336.codfw.wmnet, mw2416.codfw.wmnet, mw2372.codfw.wmnet, parse2014.codfw.wmnet, mw2395.codfw.wmnet, mw2426.codfw.wmnet, wikikube-worker2066.codfw.wmnet, mw2442.cod https://wikitech.wikimedia.org/wiki/PyBal [05:07:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2010.codfw.wmnet, parse2020.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube- [05:07:03] 65.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2105.codfw.wmnet, mw2304.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2048.codfw.wmnet, wikikube-worker2028.codfw.wmnet, parse2014.codfw.wmnet, parse2008.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2426.codfw.wmnet, wikikube-worker2066.codfw.wmnet, wikikube-worker2031.codfw.wmnet, wikikube-worker2003.codfw.wmnet, wikikube-worker2088.codfw.wm [05:07:03] ikube-worker2037.codfw.wmnet, mw2374.codfw.wmnet, mw2373.codfw.wmnet, wikikube-worker2008.codfw.wmnet, mw2305.codfw.wmnet, kubernetes2045.codfw.wmnet, mw2350.codfw.wmnet, wikikube-worke https://wikitech.wikimedia.org/wiki/PyBal [05:08:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:08:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:14:01] (03Abandoned) 10Pppery: WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [05:22:19] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:23:11] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:33:00] quick cxserver update.. [05:33:36] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-08-28-053620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071740 (owner: 10KartikMistry) [05:35:01] (03Merged) 10jenkins-bot: Update cxserver to 2024-08-28-053620-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071740 (owner: 10KartikMistry) [05:36:43] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [05:37:03] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:40:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:46:54] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:47:28] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:58:54] (03CR) 10Arnaudb: "Accidental pool is a risk in that case indeed, but in the case of T373579 and productionizing servers, I try to do it in a "2 step" motion" [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:25] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:13] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1071606 (owner: 10Muehlenhoff) [06:10:30] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:11:06] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:11:24] (03CR) 10Arnaudb: [C:03+2] mariadb: wipe pc1017 pc2017 [puppet] - 10https://gerrit.wikimedia.org/r/1071623 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [06:11:29] !log Updated cxserver to 2024-08-28-053620-production [06:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:40] (03CR) 10Muehlenhoff: [C:03+2] mx: Enable profile::auto_restarts::service for rspamd [puppet] - 10https://gerrit.wikimedia.org/r/1071564 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:12:26] moritzm: we're multimerging [06:12:31] should I go with your patch? [06:12:35] please do [06:12:40] ack :) [06:13:38] {{done}} [06:14:05] (03CR) 10Muehlenhoff: "I've updated the commit message to clarify that this is about /etc/networki/interfaces, not Ganeti itself." [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff) [06:15:43] (03PS2) 10Muehlenhoff: ganeti: Install bridge-utils on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1071199 [06:15:48] thanks [06:16:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [06:17:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:29] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc2017.codfw.wmnet with OS bookworm [06:18:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [06:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:27:03] (03CR) 10Muehlenhoff: [C:03+2] ganeti: Install bridge-utils on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff) [06:31:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [06:34:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [06:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:37:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [06:41:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [06:48:30] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [06:48:40] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s) [06:49:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [06:55:46] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:58] (03CR) 10Elukey: [C:03+1] Rebuild against latest package versions in bookworm: (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [06:56:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [06:57:29] (03CR) 10Elukey: "It is way easier to just do it now, trust me it should really be easy to do. If you look for SREBatchRunnerBase in the cookbooks repo you'" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [06:57:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2017.codfw.wmnet with OS bookworm [06:59:30] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] o/ [07:00:22] I can deploy [07:01:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:01:45] (03Merged) 10jenkins-bot: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:03:15] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]] [07:03:18] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:07:08] (03CR) 10Muehlenhoff: [C:03+2] Don't uninstall libnet-dns-perl when moving from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1070273 (https://phabricator.wikimedia.org/T373637) (owner: 10Muehlenhoff) [07:07:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:52] (03PS1) 10Kosta Harlan: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) [07:10:40] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:10:46] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:13:28] (03CR) 10Muehlenhoff: "I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070273" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [07:15:53] !log dcausse@deploy1003 dcausse: Continuing with sync [07:18:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:19:42] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [07:20:38] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]] (duration: 17m 22s) [07:20:41] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:22:58] deploy done [07:29:11] (03PS2) 10Arnaudb: mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) [07:29:11] (03CR) 10Arnaudb: "the db999 thingy did not wiped the /srv partition, which was initially empty (I created a dummy test dir to check). I wanted to see if I r" [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [07:31:14] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:37:25] FIRING: SystemdUnitFailed: user@0.service on ml-staging-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:40:06] (03CR) 10JMeybohm: [C:03+2] renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 (owner: 10JMeybohm) [07:45:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:14] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:51:31] (03PS1) 10Elukey: blubber: force rebuild to pick up git upgrades [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1071802 (https://phabricator.wikimedia.org/T373976) [07:51:53] (03Merged) 10jenkins-bot: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 (owner: 10JMeybohm) [07:53:31] (03PS1) 10Muehlenhoff: thumbor: Bump image to latest package versions in bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071803 [07:54:50] (03PS1) 10DCausse: cirrus-streaming-updater: increase s3.socket-timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071805 [07:55:27] !log evacuating leadership for all partitions assigned to broker id 2002 on kafka-main-codfw - T363210 [07:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:30] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [07:56:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:58:14] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:00:39] (03CR) 10Klausman: [C:03+1] knative: change images ownership to ml [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071630 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [08:02:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1071802 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [08:03:08] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: increase s3.socket-timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071805 (owner: 10DCausse) [08:03:24] (03CR) 10Cathal Mooney: Use global unicast to peer from cephosd but allow LL for BFD in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071677 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [08:04:05] (03CR) 10Cathal Mooney: [C:03+2] Use global unicast to peer from cephosd but allow LL for BFD in [puppet] - 10https://gerrit.wikimedia.org/r/1071677 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [08:04:26] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10132710 (10Volans) While the above is totally true the probability that a rename+reimage happens exactly at the time of the... [08:04:27] (03Merged) 10jenkins-bot: cirrus-streaming-updater: increase s3.socket-timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071805 (owner: 10DCausse) [08:05:01] (03PS2) 10Brouberol: airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) [08:07:03] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:08:21] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:08:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10132713 (10ABran-WMF) hello @wiki_willy, this host has been depooled for a few days now, is there anything that can be done on our side to help diagnose the host outside of pasting the e... [08:09:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: https://phabricator.wikimedia.org/T374215 → server depooled has hardware issues [08:09:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: https://phabricator.wikimedia.org/T374215 → server depooled has hardware issues [08:09:16] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:10:10] (03CR) 10JMeybohm: [C:04-1] "There is no diff in CI meaning that we the parameter is probably not passed down into the template generating the cronjobs. I can double c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [08:11:02] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:11:11] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:12:50] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:13:12] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:13:17] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10132717 (10MoritzMuehlenhoff) >>! In T374351#10132710, @Volans wrote: > Your problem is not a Puppet run and disabling pupp... [08:13:32] (03CR) 10Klausman: [C:03+2] knative: change images ownership to ml [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071630 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [08:13:50] (03CR) 10Klausman: [V:03+2 C:03+2] knative: change images ownership to ml [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071630 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [08:15:07] (03CR) 10Arnaudb: mariadb: pc1017 pc2017 back to normal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [08:15:36] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132731 (10dcaro) [08:16:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10132734 (10MoritzMuehlenhoff) [08:16:37] (03CR) 10Elukey: [V:03+1 C:04-1] "Self -1, I think that an ad-hoc httpd profile that exposes the SHA1 files is enough, and probably way cleaner." [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [08:18:14] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:19:15] (03CR) 10Volans: spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [08:21:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10132749 (10cmooney) 05Open→03Resolved a:03cmooney Still all looking good, there have been no logs or cases the interface reported d... [08:23:03] (03CR) 10Jcrespo: "The patch look ok, similar config to the other parsercaches but I am unsure this can be merged as is, as I belive shard is a compulsory pa" [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [08:23:35] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2002,2007].codfw.wmnet with reason: Hardware refresh [08:23:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2002,2007].codfw.wmnet with reason: Hardware refresh [08:23:59] (03PS1) 10Brouberol: airflow-test-k8s: enable datahub_gms_prod conneciton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071807 (https://phabricator.wikimedia.org/T374384) [08:24:38] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132754 (10dcaro) @cmooney @VRiley-WMF Hi! I'm almost done draining the rack, we can try to find a slot starting n... [08:25:04] (03CR) 10Volans: "post merge comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [08:26:32] I am going to restart the CI Jenkins [08:26:47] (03CR) 10David Caro: [V:03+1] spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [08:27:53] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10132768 (10dcaro) [08:30:44] (03CR) 10Btullis: [C:03+1] datahub-gms: create a Service to allow inter-kube communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071673 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [08:31:38] (03PS1) 10Cathal Mooney: Correct typo in neighbor IPv6 address for cephosd1004 [puppet] - 10https://gerrit.wikimedia.org/r/1071808 (https://phabricator.wikimedia.org/T374379) [08:34:27] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2002 with kafka-main2007 [puppet] - 10https://gerrit.wikimedia.org/r/1071610 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [08:34:50] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, please also consider folding this change into the general change such as https://gerrit.wikimedia.org/r/c/operations/puppet/+/106482" [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:34:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, please also consider folding this change into the general change such as https://gerrit.wikimedia.org/r/c/operations/puppet/+/106482" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:35:57] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [08:36:18] (03CR) 10Filippo Giunchedi: [C:03+1] puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1071031 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [08:36:58] (03PS1) 10Volans: test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 [08:37:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2021.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2036.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2076.codfw.wmnet, mw2315.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wiki [08:37:16] ker2052.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, parse2012.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmnet, mw2413.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.wmnet, wikikube-worker2098.codfw.wmnet, wikikube-worker2105 [08:37:16] mnet, kubernetes2013.codfw.wmnet, mw2304.codfw.wmnet, wikikube-worker2101.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2048.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [08:37:18] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2014.codfw.wmnet, parse2003.codfw.wmnet, parse2018.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2351.codfw.wmnet, mw2427.codfw.wmne [08:37:18] ube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, parse2012.codfw.wmnet, mw2440.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2101.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2087.codfw.wmnet, wikikube-worker2073.codfw.wmnet, wikikube-worker2106.codfw.wmnet, mw2336.codfw.wmnet, mw2416.codfw.wmnet, mw2372.codfw.wmnet, [08:37:18] 08.codfw.wmnet, mw2426.codfw.wmnet, wikikube-worker2031.codfw.wmnet, wikikube-worker2003.codfw.wmnet, wikikube-worker2004.codfw.wmnet, mw2414.codfw.wmnet, mw2335.codfw.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal [08:37:37] (03CR) 10Volans: spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [08:38:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:39:06] !log installing Java security updates on puppetservers [08:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:20] (03CR) 10Brouberol: [C:03+2] datahub-gms: create a Service to allow inter-kube communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071673 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [08:39:33] (03CR) 10CI reject: [V:04-1] test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans) [08:39:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10132799 (10cmooney) Ok patch has been merged and things are ok for now. Hosts are configured to peer with the switch unicast I... [08:41:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:42:29] (03CR) 10Brouberol: "@btullis@wikimedia.org What do you reckon? Should we use datahub-next for the airflow test instance?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [08:42:49] (03PS2) 10Volans: test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 [08:43:56] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans) [08:44:28] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [08:45:19] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [08:45:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:46:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [08:46:47] (03CR) 10jenkins-bot: test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans) [08:46:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [08:47:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [08:51:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [08:51:37] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10132830 (10cmooney) I'll leave this open for now, we will need to make a call on how to proceed here in general, there are two... [08:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:56:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [08:56:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1071811 (https://phabricator.wikimedia.org/T374421) [08:57:19] (03PS3) 10Brouberol: airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) [08:57:19] (03PS2) 10Brouberol: airflow-test-k8s: enable datahub_gms_prod conneciton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071807 (https://phabricator.wikimedia.org/T374384) [08:57:19] (03PS1) 10Brouberol: airflow-test-k8s: upgrade to an airflow version without datahub telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071812 (https://phabricator.wikimedia.org/T374384) [08:58:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421 [08:58:19] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [08:58:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421 [08:58:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T374421', diff saved to https://phabricator.wikimedia.org/P68761 and previous config saved to /var/cache/conftool/dbconfig/20240910-085854-arnaudb.json [08:58:58] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:59:05] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:59:06] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:59:30] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:59:32] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:59:44] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:59:45] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:59:50] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422 (10ops-monitoring-bot) 03NEW [09:00:19] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:00:21] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:00:35] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:00:36] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:01:12] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:01:14] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:01:22] 06SRE, 06Editing-team, 06Growth-Team, 10MediaWiki-Debug-Logger, and 4 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10132848 (10Michael) This seems to be the only place in non-maintenance code where this exception is caught: `lang=php,name=includes/Actio... [09:01:50] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:01:52] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:02:04] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:02:05] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:02:15] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:02:43] !log Restarting CI Jenkins [09:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:08] (03PS1) 10Muehlenhoff: nftables-compat-check: Don't flag dscp_default as needing conversion [puppet] - 10https://gerrit.wikimedia.org/r/1071814 [09:03:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [09:04:07] (03PS2) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) [09:06:00] (03Abandoned) 10Muehlenhoff: Fix up Phabricator firewall services, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1071147 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [09:06:01] (03PS1) 10Filippo Giunchedi: hieradata: switch prometheus-https service to production [puppet] - 10https://gerrit.wikimedia.org/r/1071815 (https://phabricator.wikimedia.org/T326657) [09:07:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2021.codfw.wmnet, parse2001.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worke [09:07:20] dfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, mw2440.codfw.wmnet, wikikube-worker2082.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2052.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2359.cod [09:07:20] , wikikube-worker2002.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, kubernetes2013.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube https://wikitech.wikimedia.org/wiki/PyBal [09:07:20] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, wikikube-worker2079.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, mw2447.codfw.wmnet, mw2368.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wi [09:07:20] orker2076.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2052.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2313.codfw.wmnet, mw2302.codfw. [09:07:20] ikikube-worker2055.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2016.codfw.wmnet, parse2012.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2397.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal [09:08:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:09:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:09:48] (03PS1) 10Filippo Giunchedi: trafficserver: use prometheus svc records for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1071816 (https://phabricator.wikimedia.org/T326657) [09:10:07] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:10:28] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:10:31] (03PS1) 10Cathal Mooney: Trust DSCP markings from VMs on routed ganeti hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) [09:10:46] (03CR) 10Hashar: [C:03+1] contint: switch java_home from jdk-11 to jdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [09:11:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374421 → replag not catching up on exec, ^C to debug', diff saved to https://phabricator.wikimedia.org/P68763 and previous config saved to /var/cache/conftool/dbconfig/20240910-091114-arnaudb.json [09:11:18] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [09:13:39] (03CR) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:16:11] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10132903 (10cmooney) It seems the routed ganeti hosts actually use nftables instead. This is nice as it does allow us to match on the incoming interf... [09:16:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10132904 (10elukey) @Jhancock.wm @Papaul Hi! If you have time I have another strange thing to figure out. I tried to set `LegacySerialRedirectionPort` = `SOL` to wiki... [09:16:47] (03CR) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:17:17] (03PS2) 10Cathal Mooney: Trust DSCP markings from VMs on routed ganeti hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) [09:17:39] (03Abandoned) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [09:17:44] (03Abandoned) 10Elukey: role::puppetmaster::frontend: add magru to the config-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1071637 (owner: 10Elukey) [09:19:32] (03PS1) 10JMeybohm: Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) [09:19:55] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10132909 (10MoritzMuehlenhoff) >>! In T374392#10132903, @cmooney wrote: > It seems the routed ganeti hosts actually use nftables instead. This is nic... [09:20:47] (03PS2) 10JMeybohm: Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) [09:20:50] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:21:57] (03CR) 10Btullis: [C:03+1] Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:23:03] (03CR) 10Btullis: Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:23:21] (03CR) 10CI reject: [V:04-1] Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:26:29] (03PS3) 10JMeybohm: Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) [09:26:45] (03CR) 10Filippo Giunchedi: "Thank you for the patch, given the error I think we can roll forward by changing the class_config title. I'll send a PS to this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:26:45] (03CR) 10Cyndywikime: [C:03+1] EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno) [09:27:08] (03CR) 10Btullis: [C:03+1] Revert "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:28:04] jayme: see my comment re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071819 I think we can roll forward instead and fix the issue, updating the patch [09:28:16] godog: fine by me [09:28:57] wonder why PCC fails in a case there HEAD does not compile but the current change does, though [09:29:59] good question [09:30:02] (03PS4) 10Filippo Giunchedi: prometheus: fix "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:30:08] jayme: ^ [09:30:25] should be enough of a fix [09:31:21] (03CR) 10Btullis: [C:03+1] prometheus: fix "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:31:52] (03CR) 10Stevemunene: [C:03+1] prometheus: fix "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:32:49] (03CR) 10Btullis: [V:03+1 C:03+2] Reduce airflow-analytics log retention from 90 to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/1071621 (https://phabricator.wikimedia.org/T370437) (owner: 10Btullis) [09:33:44] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.599e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [09:34:11] that is me...and it has a silence actually ... [09:35:22] jayme: I'll commandeer the patch if that's ok? also if you are after kafka [09:35:23] ah, only for warning [09:35:33] !log cgoubert@cumin1002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2431.codfw.wmnet [09:35:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2431.codfw.wmnet [09:35:41] godog: yeah, sure. I wasn't really into merging it :) [09:35:42] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10132942 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by cgoubert: for 1 hosts: mw2431.codfw.wmnet [09:35:47] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: fix "Configure prometheus metrics on the cephosd cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1071819 (https://phabricator.wikimedia.org/T369583) (owner: 10JMeybohm) [09:35:56] jouncebot: nowandnext [09:35:56] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [09:35:56] In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1000) [09:36:05] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10132945 (10Clement_Goubert) >>! In T374249#10131011, @Jhancock.wm wrote: > mw2431 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/results/... [09:36:12] fair enough [09:36:47] (03CR) 10Hnowlan: [C:03+1] thumbor: Bump image to latest package versions in bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071803 (owner: 10Muehlenhoff) [09:38:39] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10132947 (10Clement_Goubert) >>! In T373591#10131008, @Jhancock.wm wrote: > mw2379 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/results/... [09:39:22] gonna deploy a quick testwiki config change [09:39:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [09:40:50] (03CR) 10Btullis: [C:03+1] "Sorry I missed that the first time." [puppet] - 10https://gerrit.wikimedia.org/r/1071808 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [09:41:06] (03CR) 10Vgutierrez: [C:03+1] "new endpoints are reachable from cp servers and TLS material looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071816 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [09:42:56] (03PS1) 10Filippo Giunchedi: prometheus: fix analytics ceph server relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1071822 (https://phabricator.wikimedia.org/T369583) [09:43:36] (03CR) 10Clément Goubert: [C:03+1] graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [09:43:57] (03CR) 10Cathal Mooney: [C:03+2] Correct typo in neighbor IPv6 address for cephosd1004 [puppet] - 10https://gerrit.wikimedia.org/r/1071808 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [09:45:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:33] (03CR) 10Vgutierrez: [C:03+1] hieradata: switch prometheus-https service to production [puppet] - 10https://gerrit.wikimedia.org/r/1071815 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [09:46:23] btullis stevemunene we're almost there https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071822 [09:46:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10132983 (10ABran-WMF) db2114 is decommed (see T362948) [09:47:04] (03PS2) 10Hnowlan: Enable Copyupload-allowed-domain on testwiki, disable on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) [09:51:09] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: fix analytics ceph server relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1071822 (https://phabricator.wikimedia.org/T369583) (owner: 10Filippo Giunchedi) [09:54:17] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Use puppet spicerack module [cookbooks] - 10https://gerrit.wikimedia.org/r/1071828 [09:55:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904#10133011 (10Ladsgroup) Sorry if it's the wrong team. [09:55:56] (03CR) 10JMeybohm: [C:03+2] "Worked nicely with test-cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [09:56:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:57:54] btullis stevemunene all good: https://prometheus-eqiad.wikimedia.org/analytics/targets?search=#pool-ceph [09:58:52] (03CR) 10Volans: [C:03+1] "LGTM, thanks a lot!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1000) [10:01:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:01:53] (03PS5) 10Effie Mouzeli: cronjobs : update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) [10:02:02] (03PS24) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [10:03:20] thanks godog :) [10:03:39] trying again on my config change :) [10:03:57] (03CR) 10TrainBranchBot: "Approved by hnowlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:04:45] (03Merged) 10jenkins-bot: Enable Copyupload-allowed-domain on testwiki, disable on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:05:04] !log hnowlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]] [10:05:08] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [10:06:06] (03CR) 10Effie Mouzeli: [C:03+2] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [10:08:05] (03CR) 10Volans: [C:03+1] "LGTM, see the reply to the phab update inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [10:08:41] (03Merged) 10jenkins-bot: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [10:08:49] !log hnowlan@deploy1003 hnowlan: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:08:49] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [10:09:39] !log hnowlan@deploy1003 hnowlan: Continuing with sync [10:09:57] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the post-fix!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071828 (owner: 10Clément Goubert) [10:10:33] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Use puppet spicerack module [cookbooks] - 10https://gerrit.wikimedia.org/r/1071828 (owner: 10Clément Goubert) [10:11:40] (03CR) 10Clément Goubert: [C:03+2] sre.hosts.rename: Disable puppet and debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [10:12:43] (03PS1) 10Cathal Mooney: Revert addition of the IPv6 link-local range used in bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/1071833 (https://phabricator.wikimedia.org/T374379) [10:13:32] (03PS1) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) [10:13:51] (03CR) 10Cathal Mooney: [C:03+2] Revert addition of the IPv6 link-local range used in bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/1071833 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [10:14:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071833 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [10:14:24] (03CR) 10Fabfur: [C:03+1] Revert addition of the IPv6 link-local range used in bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/1071833 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [10:14:44] !log hnowlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]] (duration: 09m 39s) [10:14:46] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [10:15:43] (03PS2) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) [10:17:07] (03CR) 10Fabfur: [C:03+1] "confirmed that with this revert durum hosts aren't on error anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1071833 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [10:17:08] (03PS3) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) [10:17:31] (03PS4) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) [10:18:18] (03PS3) 10Ladsgroup: mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [10:18:19] (03CR) 10Ladsgroup: "I'd say add "Hosts: " footer and check experimental." [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [10:18:22] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3937/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [10:18:24] (03CR) 10Ladsgroup: [C:04-1] mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [10:18:33] (03CR) 10Jelto: [C:03+1] "lgtm, I guess the puppet code does not support multiple replicas? I was wondering why vrts2001 is set to insetup again." [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [10:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.043s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:20:24] (03PS5) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) [10:20:47] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10133162 (10Ladsgroup) >>! In T374008#10117311, @philippe.saade.WMDE wrote: > Hello @Linda-Rabea.Heyden_WMDE, could you approve the request from WMDE side? Waiting on this now. [10:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:15] (03CR) 10Elukey: "This is the general idea to avoid deploying profile::configmaster to all the puppetservers, that would bring in a ton of extra things that" [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [10:22:47] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Use puppet spicerack module [cookbooks] - 10https://gerrit.wikimedia.org/r/1071828 (owner: 10Clément Goubert) [10:24:48] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10133176 (10Volans) I don't think it does anymore unfortunately... In https://gerrit.wikimedia.org/r/plugins/gitiles/operat... [10:24:58] (03CR) 10Btullis: [C:03+1] airflow: enable testing external connections from the CLI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071674 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:25:10] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: upgrade to an airflow version without datahub telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071812 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.043s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:25:21] (03Merged) 10jenkins-bot: sre.hosts.rename: Disable puppet and debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [10:26:02] (03CR) 10Brouberol: [C:03+2] airflow: enable testing external connections from the CLI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071674 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:26:09] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: upgrade to an airflow version without datahub telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071812 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:28:00] (03CR) 10Btullis: [C:03+1] "I am pretty sure that we will still need airflow executor pods to talk to kafka to do their work, but this is good to remove it from the w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:28:10] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: enable datahub_gms_prod conneciton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071807 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 23.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:28:32] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:28:39] (03PS4) 10Brouberol: airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) [10:29:05] (03PS1) 10Hokwelum: Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) [10:30:32] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:30:41] (03PS3) 10Brouberol: airflow-test-k8s: enable datahub_gms_prod conneciton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071807 (https://phabricator.wikimedia.org/T374384) [10:31:59] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: enable datahub_gms_prod conneciton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071807 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:33:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:34:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:35:03] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10133213 (10MoritzMuehlenhoff) >>! In T374351#10133176, @Volans wrote: > I don't think it does anymore unfortunately... > > In https://gerrit.wik... [10:36:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:36:50] (03CR) 10Filippo Giunchedi: [C:03+2] graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [10:37:09] (03PS1) 10Brouberol: airflow-test-k8s: add http:// scheme top the datahub gms URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071839 (https://phabricator.wikimedia.org/T374384) [10:38:12] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: add http:// scheme top the datahub gms URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071839 (https://phabricator.wikimedia.org/T374384) (owner: 10Brouberol) [10:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 24.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:38:45] (03CR) 10Effie Mouzeli: [C:03+2] cronjobs : update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [10:39:01] (03PS1) 10Vgutierrez: hiera: let purged@codfw|ulsfo use main-eqiad kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) [10:39:43] (03Merged) 10jenkins-bot: cronjobs : update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [10:39:50] (03Merged) 10jenkins-bot: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [10:39:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) (owner: 10Vgutierrez) [10:41:16] (03PS2) 10Vgutierrez: hiera: let purged@codfw|ulsfo use main-eqiad kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) [10:41:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) (owner: 10Vgutierrez) [10:41:30] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:44:47] !log installing bind9 security updates (client-side tools/libs only) [10:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:56] (03PS1) 10GergesShamon: [arwiki] Change the wordmark and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) [10:45:31] (03CR) 10JMeybohm: [C:03+1] "Sounds reasonable, lets try" [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) (owner: 10Vgutierrez) [10:45:42] (03CR) 10Vgutierrez: [C:03+2] hiera: let purged@codfw|ulsfo use main-eqiad kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/1071841 (https://phabricator.wikimedia.org/T373189) (owner: 10Vgutierrez) [10:46:18] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:46:51] (03PS1) 10Effie Mouzeli: ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) [10:47:25] RESOLVED: SystemdUnitFailed: user@0.service on ml-staging-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:49:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:50:45] <_Gerges> jouncebot: next [10:50:46] In 1 hour(s) and 9 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1200) [10:51:48] !log switching purged in codfw and ulsfo to use main-eqiad kafka cluster - T373189 [10:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:51] T373189: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189 [10:53:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [10:54:06] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10133285 (10MoritzMuehlenhoff) [11:01:35] (03PS1) 10Vgutierrez: hiera: let purged use closest cluster on codfw, ulsfo and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1071844 (https://phabricator.wikimedia.org/T363210) [11:02:16] (03CR) 10Vgutierrez: [C:04-2] "do not merge till T363210 is done" [puppet] - 10https://gerrit.wikimedia.org/r/1071844 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [11:04:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:04:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:04:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T370903)', diff saved to https://phabricator.wikimedia.org/P68766 and previous config saved to /var/cache/conftool/dbconfig/20240910-110409-ladsgroup.json [11:04:14] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:05:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:06:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:06:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T371742)', diff saved to https://phabricator.wikimedia.org/P68767 and previous config saved to /var/cache/conftool/dbconfig/20240910-110614-ladsgroup.json [11:06:18] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:14:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:18:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T370903)', diff saved to https://phabricator.wikimedia.org/P68768 and previous config saved to /var/cache/conftool/dbconfig/20240910-111835-ladsgroup.json [11:18:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:21:10] (03PS4) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [11:22:10] (03Abandoned) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [11:26:22] (03PS1) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 [11:31:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:33:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P68769 and previous config saved to /var/cache/conftool/dbconfig/20240910-113342-ladsgroup.json [11:38:06] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406#10133386 (10Ladsgroup) It first needs a sponsor from a wmf staff. FWIW, This access is quite sensitive. [11:44:48] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:45:14] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:47:02] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: switch prometheus-https service to production [puppet] - 10https://gerrit.wikimedia.org/r/1071815 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [11:47:04] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:48:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P68770 and previous config saved to /var/cache/conftool/dbconfig/20240910-114850-ladsgroup.json [11:51:32] (03CR) 10Alexandros Kosiaris: [C:03+2] tests: Bump various tests from php7.2 to php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1070996 (owner: 10Alexandros Kosiaris) [11:52:00] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki::maintenance: Remove php72 remnant [puppet] - 10https://gerrit.wikimedia.org/r/1070995 (owner: 10Alexandros Kosiaris) [11:52:01] (03PS4) 10Arnaudb: mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) [11:52:17] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 25% [puppet] - 10https://gerrit.wikimedia.org/r/1070549 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [11:56:13] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Bump image to latest package versions in bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071803 (owner: 10Muehlenhoff) [11:57:54] (03CR) 10Filippo Giunchedi: [C:03+2] trafficserver: use prometheus svc records for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1071816 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [11:57:57] FIRING: ProbeDown: Service prometheus-https:443 has failed probes (http_prometheus-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#prometheus-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:09] uh [11:58:11] (03PS1) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:58:16] ugh, that's me [11:58:16] godog: ^ [11:58:18] !incidents [11:58:18] 5149 (ACKED) Host cr2-magru - PING - Packet loss = 100% [11:58:18] :D [11:58:18] 5151 (UNACKED) ProbeDown sre (10.2.2.25 ip4 prometheus-https:443 probes/service http_prometheus-https_ip4 eqiad) [11:58:21] my apologies [11:58:25] !ack 5149 [11:58:25] 5149 (ACKED) Host cr2-magru - PING - Packet loss = 100% [11:58:28] !ack 5151 [11:58:29] 5151 (ACKED) ProbeDown sre (10.2.2.25 ip4 prometheus-https:443 probes/service http_prometheus-https_ip4 eqiad) [11:58:33] lol [11:58:36] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [11:58:39] jayme: :P [11:58:39] I'm checking though it is all good in terms of availabily [11:58:42] availability [11:58:43] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:58:51] (03PS2) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:58:54] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:59:14] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1200) [12:00:27] (03PS5) 10Arnaudb: mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) [12:00:27] (03CR) 10Arnaudb: "This should cover everything we discussed" [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [12:01:12] (03PS3) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [12:01:12] (03PS1) 10Filippo Giunchedi: service::catalog: don't page for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/1071860 [12:01:27] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] service::catalog: don't page for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/1071860 (owner: 10Filippo Giunchedi) [12:01:46] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [12:02:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [12:03:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:03:41] !incidents [12:03:42] 5149 (ACKED) Host cr2-magru - PING - Packet loss = 100% [12:03:42] 5151 (ACKED) ProbeDown sre (10.2.2.25 ip4 prometheus-https:443 probes/service http_prometheus-https_ip4 eqiad) [12:03:42] 5152 (UNACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [12:03:46] !ack 5152 [12:03:47] 5152 (ACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [12:03:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T370903)', diff saved to https://phabricator.wikimedia.org/P68771 and previous config saved to /var/cache/conftool/dbconfig/20240910-120357-ladsgroup.json [12:04:00] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:05:16] seems OVH in GB has issues [12:05:41] (03CR) 10Arnaudb: mariadb: pc1017 pc2017 back to normal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [12:07:33] jayme: not only OVH.. see _security [12:07:53] <_Gerges> jouncebot: next [12:07:53] In 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1300) [12:07:57] RESOLVED: ProbeDown: Service prometheus-https:443 has failed probes (http_prometheus-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#prometheus-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński) [12:09:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński) [12:11:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T371742)', diff saved to https://phabricator.wikimedia.org/P68772 and previous config saved to /var/cache/conftool/dbconfig/20240910-121122-ladsgroup.json [12:11:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:12:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [12:13:16] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:13:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:13:53] (03PS1) 10Ladsgroup: tables catalog: Add first batch of extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1071862 (https://phabricator.wikimedia.org/T363581) [12:14:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:18:17] (03PS1) 10Filippo Giunchedi: service::catalog: send host for prometheus-https probes [puppet] - 10https://gerrit.wikimedia.org/r/1071863 [12:18:44] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:18:56] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:19:28] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:19:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [12:20:02] (03CR) 10Filippo Giunchedi: [C:03+2] service::catalog: send host for prometheus-https probes [puppet] - 10https://gerrit.wikimedia.org/r/1071863 (owner: 10Filippo Giunchedi) [12:23:41] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2007.codfw.wmnet [12:23:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2007.codfw.wmnet [12:24:49] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:25:33] (03CR) 10Muehlenhoff: Bird::anycast - allow BFD connections from router link-local IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [12:26:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10133598 (10cmooney) To confirm the links look good, interfaces come up when enable and rx light is good: ` cmooney@re1.c... [12:26:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P68773 and previous config saved to /var/cache/conftool/dbconfig/20240910-122629-ladsgroup.json [12:35:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, let's enable this for idm-test for some live testing next." [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [12:36:09] MatmaRex: I'm going to try for the MOS namespace change again during the next backport window. Looks like you've got a bunch of patches queued up as well [12:37:01] cscott: yeah. most of them are just cleanup and can be bumped [12:39:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443 (10elukey) 03NEW [12:39:28] (03PS6) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) [12:41:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P68774 and previous config saved to /var/cache/conftool/dbconfig/20240910-124136-ladsgroup.json [12:43:44] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:43:55] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2127.codfw.wmnet [12:44:09] (03PS7) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) [12:44:10] (03PS1) 10Elukey: role::puppetserver: fix git_dir for conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/1071867 (https://phabricator.wikimedia.org/T374443) [12:44:48] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10133680 (10elukey) p:05Triage→03High [12:44:58] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10133681 (10elukey) a:03elukey [12:45:11] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3938/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071867 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [12:45:45] !log dropping bv2013_edits table everywhere [12:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:16] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10133704 (10elukey) [12:48:43] (03PS2) 10Ladsgroup: tables catalog: Add first batch of extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1071862 (https://phabricator.wikimedia.org/T363581) [12:48:55] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables catalog: Add first batch of extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1071862 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [12:50:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2127.codfw.wmnet [12:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:53:37] <_Gerges> jouncebot: next [12:53:38] In 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1300) [12:53:40] PROBLEM - MariaDB Replica Lag: s3 on db2205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 563.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:36] normal ↑ [12:55:28] PROBLEM - Hadoop NodeManager on an-worker1144 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:55:44] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:55:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: provisionning db2127.codfw.wmnet - T373579 [12:55:52] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [12:56:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: provisionning db2127.codfw.wmnet - T373579 [12:56:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: provisionning db2127.codfw.wmnet - T373579 [12:56:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: provisionning db2127.codfw.wmnet - T373579 [12:56:17] (03PS1) 10David Caro: codfw1dev,cloud: replace cloudinfra-db-01 by 02 as it was replaced [puppet] - 10https://gerrit.wikimedia.org/r/1071870 [12:56:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T371742)', diff saved to https://phabricator.wikimedia.org/P68776 and previous config saved to /var/cache/conftool/dbconfig/20240910-125643-ladsgroup.json [12:56:44] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [12:56:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:56:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:56:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:57:02] (03PS1) 10Ottomata: refine - bump to refinery version 0.2.49 [puppet] - 10https://gerrit.wikimedia.org/r/1071871 (https://phabricator.wikimedia.org/T356762) [12:57:05] downtiming, sorry for the noise [12:57:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T371742)', diff saved to https://phabricator.wikimedia.org/P68777 and previous config saved to /var/cache/conftool/dbconfig/20240910-125705-ladsgroup.json [12:58:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2205.codfw.wmnet with reason: maintenance [12:58:23] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071867 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [12:58:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2205.codfw.wmnet with reason: maintenance [12:58:41] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2227.codfw.wmnet onto db2127.codfw.wmnet [12:59:51] (03PS7) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1300). [13:00:05] MatmaRex, jan_drewniak, _Gerges, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:33] you can do me last [13:00:38] <_Gerges> Here [13:01:00] o/ [13:01:52] <_Gerges> Sorry I won't share [13:02:23] I can deploy, I guess [13:02:35] though I’m not sure if I feel brave enough for the enwiki MOS patch tbh 😅 [13:02:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/QuickSurveys] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071708 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdlrobson) [13:03:03] let’s start with jan_drewniak [13:03:21] 👍 [13:04:14] Lucas_WMDE: i'm here. I think MatmaRex volunteered to assist w/ advice on the maintenance script side of the MOS patch? [13:04:28] RECOVERY - Hadoop NodeManager on an-worker1144 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:04:53] Lucas_WMDE: also I'm not touching enwiki yet, the first patch is deliberately /just/ very small wikis w/ ~50 pages with the MOS prefix. Hopefully if things explode, they will do so on a small scale before to attempt enwiki. [13:04:59] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10133804 (10Linda-Rabea.Heyden_WMDE) Hi, I hereby approve the request from WMDE side! [13:05:44] (03CR) 10Lucas Werkmeister (WMDE): "Can you optimize the new SVGs? There’s no need to send all those extra bytes with Inkscape settings to [hundreds of millions](https://stat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [13:05:57] cscott: I see [13:06:00] * Lucas_WMDE looks more closely [13:06:07] (03Merged) 10jenkins-bot: Support new heading layout [extensions/QuickSurveys] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071708 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdlrobson) [13:06:27] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1071708|Support new heading layout (T373039 T374377)]] [13:06:31] haha, you wrote that in double asterisks for anyone who would bother to look at the commit message :D [13:06:32] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [13:06:32] T374377: Regression: QuickSurveys inject themselves into the heading - https://phabricator.wikimedia.org/T374377 [13:06:40] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1071870 (owner: 10David Caro) [13:06:41] and then had to explain it for me anyways because I had only looked at the deployment calendar so far ^^ [13:07:12] jouncebot: next [13:07:13] In 1 hour(s) and 52 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [13:07:20] ok, so we have a bit of time for those maint scripts [13:07:22] Lucas_WMDE: no worries. https://phabricator.wikimedia.org/T363538#10131953 has the maintenance script commands that would need to be run. [13:08:32] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1010 is CRITICAL: SSL CRITICAL - Certificate kafka-jumbo1010.eqiad.wmnet valid until 2024-09-17 13:08:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:09:14] !log rolling restart of {pdns-recursor,haproxy}.service on A:dnsbox [13:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:10] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jdlrobson: Backport for [[gerrit:1071708|Support new heading layout (T373039 T374377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:10] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10133837 (10MoritzMuehlenhoff) I'll do a complete audit of all uses of the ca_server setting. For Puppet 7 end points the Puppe... [13:10:11] cscott: apologies if this was already discussed, but is there a reason why “talk” is only translated in some of the namespace names? [13:10:20] e.g. MOS_yɛltɔɣa on dagwiki but MOS_talk on aswiki [13:11:32] jan_drewniak / Jdlrobson: can you test the Quicksurveys change on mwdebug? [13:12:14] (03CR) 10Slyngshede: Permission approval/rejection (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [13:13:36] * jan_drewniak Lucas_WMDE: Ah good point, we don't have any quickSurveys running on production at the moment, but we did verify the change on beta [13:13:48] ok, should I just roll it out then? [13:13:57] on some wikis there seemed to be precedent for handling "global/project" (read, english) namespaces that way (cf zhwiki), on others (aswiki, bnwiki, thwiki, slwiki) it wasn't clear to me as a non-native speaker how to properly localize. My understanding is that (a) there aren't actually any Talk:MOS pages on those wikis (the MOS: pages were imported from enwiki but the discussions weren't), and (b) the important thing is to hold the [13:13:57] namespace number in the title DB, we can always rename the namespace (with an appropriate alias) later w/o having to rerun maintenance scripts. [13:14:00] let's do it! [13:14:02] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jdlrobson: Continuing with sync [13:14:15] MatmaRex: double-checking ^ does this sound correct to you [13:14:43] cscott: ack (I’m not sure about the “w/o having to rerun maintenance scripts” but AFAIK we have maintenance scripts for it so I agree it’s doable to rename later if needed) [13:15:28] then I’ll go ahead with MOS after the Quicksurveys backport finishes [13:15:43] (03PS1) 10Elukey: aux-services: update oauth2 image for Jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071872 (https://phabricator.wikimedia.org/T369491) [13:15:44] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:16:21] cscott: yeah, mostly correct, you'd need another script run to clean up possible conflicts after adding a namespace alias later. that's no big deal though [13:17:23] (03CR) 10Ladsgroup: [C:03+1] "Please downtime the hosts before merging this, just in case." [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [13:18:04] (03PS2) 10Elukey: aux-services: update Docker images for Jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071872 (https://phabricator.wikimedia.org/T369491) [13:18:24] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10133869 (10Volans) In the optic of a cookbook to replace puppet merge I'd try to use https://doc.wikimedia.org/spicerack/master/... [13:18:39] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071708|Support new heading layout (T373039 T374377)]] (duration: 12m 11s) [13:18:43] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [13:18:43] T374377: Regression: QuickSurveys inject themselves into the heading - https://phabricator.wikimedia.org/T374377 [13:19:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(For the record, we briefly discussed this change [in IRC](https://wm-bot.wmcloud.org/browser/index.php?start=09%2F10%2F2024&end=09%2F10%2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [13:19:29] alright, let’s go [13:19:35] (03PS7) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [13:20:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [13:20:57] (03Merged) 10jenkins-bot: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [13:21:11] I guess the change will be testable on mwdebug by looking at e.g. https://sl.wikipedia.org/w/index.php?title=MOS:T363538&action=info [13:21:11] (03CR) 10Elukey: [V:03+1 C:03+2] role::puppetserver: fix git_dir for conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/1071867 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:21:12] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:21:17] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1070975|Elevate pseudo-namespace MOS to a real namespace on most wikis which use it (T363538)]] [13:21:19] and seeing which namespace it reporst for this (nonexistent) page [13:21:22] *reports [13:22:46] (03CR) 10Arnaudb: "sure!" [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [13:23:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355 [13:23:47] T374355: Productionize pc(1|2)017 - https://phabricator.wikimedia.org/T374355 [13:23:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355 [13:24:33] Lucas_WMDE: that sounds right. I can also create a user page like https://en.wikipedia.org/wiki/User:Cscott/T363538 to verify that the MOS links on wiki don't break. [13:24:35] (03CR) 10Arnaudb: [C:03+2] mariadb: pc1017 pc2017 back to normal [puppet] - 10https://gerrit.wikimedia.org/r/1071750 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [13:26:18] ugh, “Check 'check_testservers_baremetal' failed” [13:26:21] (03CR) 10Btullis: statistics hosts: enable CPUWeight (cgroupsv2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [13:26:24] Status code: expected 301, got 503. [13:26:30] from mwdebug2001 [13:26:37] (03PS8) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) [13:26:37] * Lucas_WMDE loogs at logstash [13:26:39] (03CR) 10Elukey: role::puppetserver: add TLS+HTTP stack to publish SHA1 values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:27:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10133906 (10VRiley-WMF) ganeti1039 B2 U4 CableID 4893 Port 3 ganeti1040 B2 U15 CableID 5005 Port 29 ganeti1041 B2 U16 CableID 20220202 Port 28 ganeti1042... [13:27:47] not seeing anything in logstash… retrying checks in scap [13:27:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10133912 (10VRiley-WMF) [13:28:00] cscott: good idea, but probably also on a non-english wiki ^^ [13:28:06] Lucas_WMDE: https://as.wikipedia.org/wiki/%E0%A6%B8%E0%A6%A6%E0%A6%B8%E0%A7%8D%E0%A6%AF:Cscott/T363538 [13:28:07] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, cscott: Backport for [[gerrit:1070975|Elevate pseudo-namespace MOS to a real namespace on most wikis which use it (T363538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:10] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:28:15] ok, now it worked and the change is on mwdebug [13:28:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10133916 (10cmooney) >>! In T373097#10129063, @Jelto wrote: > I depooled `gitlab-runner2003` for tomorrows maintenance Thanks!... [13:29:17] Lucas_WMDE: https://as.wikipedia.org/wiki/MOS:CAPTIONS is a broken link/empty page now, which I think is "as expected" until you run the namespaceDupes script to move the existing page [13:29:28] So "seems to be working" because "everything is broken"? ;) [13:29:41] *thinks* [13:29:46] yeah I think you’re right [13:29:53] the fact that it becomes a redlink is correct [13:29:53] ish [13:30:10] yeah, it doesn't become actually red because we don't rerun refreshlinks job [13:30:11] ok let’s keep going then [13:30:11] (03PS1) 10DCausse: cirrus-streaming-updater: bump to v20240910132552-fa373fd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071875 [13:30:23] but this matches what I saw when I was testing on my local wiki [13:30:25] cscott: I purged it and it became red, so I was confused by that for a bit [13:30:38] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, cscott: Continuing with sync [13:30:59] (https://sl.wikipedia.org/w/index.php?title=MOS:T363538&action=info correctly shows ns126 btw) [13:31:01] oh, then maybe it won't become un-red when you run namespaceDupes and you'll have to purge that page again to bluelink it [13:31:08] could be [13:31:37] * Lucas_WMDE prepares tmux session named “T363538” on mwmaint1002 [13:33:39] * Lucas_WMDE is also mildly annoyed that one of the scripts dry-runs by default and one has a --dry-run option :D [13:34:12] yes i also noticed that. which do you prefer, i could patch them to be the same. :) [13:34:46] (03PS1) 10Clément Goubert: mediawiki: Move job spec for reuse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071864 [13:35:00] eh, I don’t think I dislike it enough to change the behavior and risk breaking all sorts of things relying on the old stuff [13:35:02] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070975|Elevate pseudo-namespace MOS to a real namespace on most wikis which use it (T363538)]] (duration: 13m 45s) [13:35:06] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:35:07] !log jebe@deploy1003 Started deploy [analytics/refinery@464c114]: Regular analytics weekly train [analytics/refinery@464c114d] [13:35:11] I guess I wouldn’t mind a --dry-run no-op option on namespaceDupes ^^ [13:35:19] (and if you --dry-run --fix together it yells at you) [13:35:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:35:27] scap is done! maint script time [13:35:50] lol, it prints “Oh noeees” [13:35:52] start with aswiki and i'll verify that my https://as.wikipedia.org/wiki/%E0%A6%B8%E0%A6%A6%E0%A6%B8%E0%A7%8D%E0%A6%AF:Cscott/T363538 page gets fixed [13:35:57] guessing that refers to [13:35:57] > id=21247 ns=0 dbk=MOS: *** invalid title and --add-prefix not specified [13:36:32] (03CR) 10Elukey: [C:03+2] role::puppetserver: add TLS+HTTP stack to publish SHA1 values [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [13:36:35] cscott: do you mind if I make the suffix /T363538 btw? [13:36:35] oh yeah [[MOS:]] needs to be manually fixed up, folks on phab thought that was probably generated by a broken template. [13:36:45] IMHO that would be nicer for people to see in the logs [13:36:47] that seems fine [13:36:49] ok [13:37:00] i'll change that in the phab comment so we use the same suffix later when we tackle enwiki [13:37:57] ok! [13:38:10] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php aswiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --fix | tee T363538-aswiki-namespaceDupes # crashed, DBQueryError [13:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:34] aha, MOS:_HEAD -> MOS:HEAD presumably conflicting with MOS:HEAD? [13:39:03] https://as.wikipedia.org/wiki/%E0%A6%B8%E0%A6%A6%E0%A6%B8%E0%A7%8D%E0%A6%AF:Cscott/T363538 still shows a bluelink so some of the pages moves clearly went through [13:39:39] underscore is not a valid title prefix character i guess? [13:40:05] yeah :/ [13:40:37] doesn't seem like a DBQueryError is an appropriate outcome in any case, I'd be fine with the script just skipping problematic titles for manual fixup. [13:40:41] I guess that needs --add-prefix [13:41:11] shall I retry with --add-prefix=T363538/? [13:41:11] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:41:24] maybe I should just have `--add-prefix /T363538` to all the commands, instead of waiting until they fail and then adding it? [13:41:35] but yeah, retry with --add-prefix for aswiki [13:41:50] btw MatmaRex and _Gerges I think it’s basically certain we won’t get to anything else in this window [13:41:54] sorry that wasn't the right prefix though [13:41:58] just in case you’re still waiting [13:42:14] i'm just following along [13:42:16] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php aswiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-aswiki-namespaceDupes-prefix [13:42:18] Lucas_WMDE: is that prefix or suffix? [13:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:26] Lucas_WMDE: ah, you fixed it [13:42:28] prefix, because I thought suffix wouldn’t fix the _ issue [13:42:29] !log jebe@deploy1003 Finished deploy [analytics/refinery@464c114]: Regular analytics weekly train [analytics/refinery@464c114d] (duration: 07m 22s) [13:42:38] but maybe it would also have worked and just turned _HEAD_suffix into HEAD_suffix [13:42:42] anyway it’s done already [13:42:44] <_Gerges> Here [13:42:45] i think he used suffix first, and then MOS:_Head broke because the problem was with the _ prefix. [13:42:57] !log jebe@deploy1003 Started deploy [analytics/refinery@464c114] (thin): Regular analytics weekly train THIN [analytics/refinery@464c114d] [13:43:24] _Gerges: you don’t need to wait, I’ll be busy with these MOS namespace things for the rest of the window. please reschedule, sorry [13:43:32] <_Gerges> I will work on improving the svg files. [13:43:56] <_Gerges> ok no problem [13:43:56] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles aswiki | tee T363538-aswiki-cleanupTitles [13:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:00] _Gerges: okay, thanks! [13:44:22] alright, I *think* aswiki is all done, I’ll see if I can get the outputs all into Phabricator pastes [13:45:01] hm, might as well try out this phaste command and see what it does [13:45:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:45:54] ok, phaste with multiple files just pastes them together, that’s not what I want (P68779) [13:46:44] I'm poking around on aswiki, and MOS:HEAD works and so does MOS:_HEAD (presumably because we strip the space during title construction) so the fact that the old MOS:_HEAD ended up with a prefix seems harmless. [13:47:25] FIRING: SystemdUnitFailed: envoyproxy.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:28] https://as.wikipedia.org/w/index.php?title=MOS:T363538/_HEAD&redirect=no was just a redirect. But adding the prefix was effective to disambiguate. [13:47:28] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:47:35] (03PS7) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [13:47:35] (03PS6) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [13:47:35] (03PS1) 10DCausse: wdqs: fix CATEGORY_ENDPOINT env var [puppet] - 10https://gerrit.wikimedia.org/r/1071877 (https://phabricator.wikimedia.org/T374016) [13:47:55] !log jebe@deploy1003 Finished deploy [analytics/refinery@464c114] (thin): Regular analytics weekly train THIN [analytics/refinery@464c114d] (duration: 04m 58s) [13:47:57] sounds good [13:48:01] let’s try bnwiki then [13:48:14] !log jebe@deploy1003 Started deploy [analytics/refinery@464c114] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@464c114d] [13:48:21] bnwiki also has a MOS: page [13:48:26] when you get the output for each wiki pasted, i'll make a post on the appropriate village pump asking the community to audit the 'broken' files and make any cleanups required. [13:48:37] cscott: do you prefer prefix or suffix? I could try suffix now for a change [13:49:14] I feel like prefix is slightly nicer, you can Special:PrefixIndex it [13:49:18] try suffix, there are only 24 pages on bnwiki it is worth figuring out if there are reasons to prefer one or the other before we get to the big ones [13:49:25] sounds good [13:49:37] dry-run for cleanupTitles is taking a moment btw, quite a few pages in total on this wiki [13:49:46] okay, it went through [13:50:10] and should I run with suffix right away or try it without first? [13:50:22] I’m not sure if --add-suffix only affects titles that would otherwise crash or all titles [13:50:24] !log sudo cumin "A:cp" 'disable-puppet "merging CRs 1065283"' [13:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:34] I don't know if suffix would have worked for MOS:_HEAD because the underscore would still be invalid. But maybe it would have become MOS:HEAD/T363538 and not conflicted and that would have been ok. [13:50:43] but looking at the “hyphen” in https://phabricator.wikimedia.org/P68781 I suspect it only affects the pages where the suffix is “needed” [13:50:56] cscott: that sounds likely to me yeah [13:51:03] I’ll try it with the suffix directly [13:51:04] Maybe i'll say i'm convinced by your argument about prefix search and we should stick with that. [13:51:07] (03CR) 10Ssingh: [C:03+2] puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [13:51:13] or suffix. i'm +/-0 either way [13:51:51] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes bnwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-suffix=/T363538 --fix | tee T363538-bnwiki-namespaceDupes [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] !log jebe@deploy1003 Finished deploy [analytics/refinery@464c114] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@464c114d] (duration: 03m 43s) [13:52:04] okay it looks like --add-suffix is nice and only adds the suffix if needed [13:52:35] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles bnwiki | tee T363538-bnwiki-cleanupTitles [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:38] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:52:53] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451 (10JMeybohm) 03NEW [13:52:57] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134068 (10JMeybohm) [13:53:44] (03PS1) 10JMeybohm: Decom kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071878 (https://phabricator.wikimedia.org/T374451) [13:54:12] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071877 (https://phabricator.wikimedia.org/T374016) (owner: 10DCausse) [13:54:43] > Minor fixes to output. Add a --suffix= option to add a suffix to page titles on second-order conflict. [13:54:44] http://mediawiki.org/wiki/Special:Code/MediaWiki/9355 [13:55:02] by bvibber 19 years ago :D [13:55:09] brooke ftw [13:55:23] so basically rename first [13:55:45] if that fails cause there the destination already exists (due to the new namespace having an overlap with the old): rename! [13:55:58] alright, then I’ll do dagwiki with --add-prefix next [13:56:02] and I think we are supposed to copy paste to the task the output of the script [13:56:09] (also, lol at “100.26% done” in the bnwiki cleanupTitles output o_O) [13:56:19] hashar: yes, I dumped some Phabricator pastes in there already [13:56:25] I thought that was nicer than pasting directly into the comment [13:56:29] cleanupTitles is especially chatty [13:56:42] https://bn.wikipedia.org/wiki/%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%AC%E0%A6%B9%E0%A6%BE%E0%A6%B0%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80:Cscott/T363538 seems to work [13:57:25] RESOLVED: SystemdUnitFailed: envoyproxy.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:25] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Run_a_maintenance_script_on_all_wikis has some doc about namespaceDupes.php [13:57:29] but that does not document the conflicts [13:57:30] bah [13:57:34] lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes dagwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-dagwiki-namespaceDupes [13:57:37] oops [13:57:38] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes dagwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-dagwiki-namespaceDupes [13:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:42] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:57:48] nice [13:58:00] !log sudo cumin -b11 "A:cp" 'run-puppet-agent --enable "merging CRs 1065283"' [13:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] lots of MOS in dagwiki, wow [13:58:29] hashar: https://wikitech.wikimedia.org/wiki/Adding_Namespaces#Deployment has some more documentation [13:58:49] probably those two should be merged or cross-referenced [13:58:50] hurrah for duplicate doc! [13:59:00] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BFD on 'core' EBGP peerings from L3 switches to CRs - https://phabricator.wikimedia.org/T374452 (10cmooney) 03NEW p:05Triage→03Low [13:59:35] (03PS1) 10JMeybohm: Replace kafka-main2002 with kafka-main2006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071880 (https://phabricator.wikimedia.org/T363210) [13:59:40] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles dagwiki | tee T363538-dagwiki-cleanupTitles [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:49] “101.90% done” [13:59:55] definitely an off-by-one error somewhere in there ^^ [14:00:08] we give it 101% here at the wikimedia foundation [14:00:35] (03CR) 10David Caro: [C:03+2] codfw1dev,cloud: replace cloudinfra-db-01 by 02 as it was replaced [puppet] - 10https://gerrit.wikimedia.org/r/1071870 (owner: 10David Caro) [14:01:07] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes idwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-idwiki-namespaceDupes [14:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:17] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2002.codfw.wmnet [14:01:21] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles idwiki | tee T363538-idwiki-cleanupTitles [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:37] :D [14:01:40] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2237 [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:01:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T371742)', diff saved to https://phabricator.wikimedia.org/P68788 and previous config saved to /var/cache/conftool/dbconfig/20240910-140141-ladsgroup.json [14:01:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:01:50] https://dag.wikipedia.org/wiki/%C5%8Aun_su:Cscott/T363538 looks good [14:03:47] lol. “ERR-CONDUIT-CORE: File size is too large.” [14:03:52] that’s unfortunate [14:03:58] guess I’ll try to compress it? [14:05:17] (03PS2) 10JMeybohm: Replace kafka-main2002 with kafka-main2006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071880 (https://phabricator.wikimedia.org/T363210) [14:05:42] that worked, yay [14:05:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:06:11] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes jawiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-jawiki-namespaceDupes [14:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:15] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [14:06:23] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles jawiki | tee T363538-jawiki-cleanupTitles [14:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:27] (03CR) 10JHathaway: [C:03+2] puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1071031 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:06:36] !log Depooling kubernetes2040.codfw.wmnet kubernetes2041.codfw.wmnet kubernetes2058.codfw.wmnet mw2440.codfw.wmnet mw2442.codfw.wmnet mw2443.codfw.wmnet parse2011.codfw.wmnet parse2012.codfw.wmnet parse2013.codfw.wmnet wikikube-worker2039.codfw.wmnet - T373097 [14:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:40] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [14:06:59] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [14:07:12] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2040.codfw.wmnet [14:07:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: provisionning db2237.codfw.wmnet - T373579 [14:07:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2040.codfw.wmnet [14:07:52] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [14:07:55] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2041.codfw.wmnet [14:08:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: provisionning db2237.codfw.wmnet - T373579 [14:08:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: provisionning db2237.codfw.wmnet - T373579 [14:08:19] (03PS1) 10Elukey: profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) [14:08:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: provisionning db2237.codfw.wmnet - T373579 [14:08:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2041.codfw.wmnet [14:08:38] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2058.codfw.wmnet [14:09:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2227.codfw.wmnet onto db2127.codfw.wmnet [14:09:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2058.codfw.wmnet [14:09:15] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2440.codfw.wmnet [14:09:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2137 in db2237 for T373579', diff saved to https://phabricator.wikimedia.org/P68792 and previous config saved to /var/cache/conftool/dbconfig/20240910-140918-arnaudb.json [14:09:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2440.codfw.wmnet [14:09:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2442.codfw.wmnet [14:09:55] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:07] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes mswiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-mswiki-namespaceDupes [14:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles mswiki | tee T363538-mswiki-cleanupTitles [14:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2442.codfw.wmnet [14:10:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2443.codfw.wmnet [14:10:37] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [14:10:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [14:10:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2002.codfw.wmnet [14:11:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2443.codfw.wmnet [14:11:13] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2011.codfw.wmnet [14:11:19] (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071878 (https://phabricator.wikimedia.org/T374451) (owner: 10JMeybohm) [14:11:22] https://id.wikipedia.org/wiki/Pengguna:Cscott/T363538 looks good [14:11:36] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes simplewiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-simplewiki-namespaceDupes [14:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:40] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [14:11:47] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles simplewiki | tee T363538-simplewiki-cleanupTitles [14:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: bump to v20240910132552-fa373fd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071875 (owner: 10DCausse) [14:13:08] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes slwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-slwiki-namespaceDupes [14:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:20] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles slwiki | tee T363538-slwiki-cleanupTitles [14:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134209 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2002.codfw.wmnet` - kafk... [14:14:19] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes thwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-thwiki-namespaceDupes [14:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2011.codfw.wmnet [14:14:30] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2012.codfw.wmnet [14:14:30] !log sudo cumin "A:cp" 'disable-puppet "merging CRs 1065286"' [14:14:32] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles thwiki | tee T363538-thwiki-cleanupTitles [14:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2012.codfw.wmnet [14:15:08] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2013.codfw.wmnet [14:15:10] (03PS1) 10Muehlenhoff: Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071882 (https://phabricator.wikimedia.org/T366355) [14:15:10] (03CR) 10Elukey: [C:03+2] role::puppetserver: add TLS+HTTP stack to publish SHA1 values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071834 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:15:31] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2137.codfw.wmnet onto db2237.codfw.wmnet [14:15:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2013.codfw.wmnet [14:15:46] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2039.codfw.wmnet [14:15:58] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes zhwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee T363538-zhwiki-namespaceDupes [14:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:05] alright, another big ’un to finish it [14:16:08] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10134222 (10JMeybohm) [14:16:17] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript cleanupTitles zhwiki | tee T363538-zhwiki-cleanupTitles [14:16:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2039.codfw.wmnet [14:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:23] this one will probably also need zstd [14:16:47] what is zstd? [14:16:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P68801 and previous config saved to /var/cache/conftool/dbconfig/20240910-141649-ladsgroup.json [14:17:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 1%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68802 and previous config saved to /var/cache/conftool/dbconfig/20240910-141705-arnaudb.json [14:17:33] compression program, I used it to compress the cleanupTitles output of jawiki and idwiki already [14:17:41] https://phabricator.wikimedia.org/T363538#10134168 and https://phabricator.wikimedia.org/T363538#10134206 [14:17:49] because uncompressed it’s too big for phabricator [14:18:00] ook, i thought maybe it was something specific to the chinese character set [14:18:05] gzip might have worked as well but I tend to just default to zstd [14:18:08] ah, no ^^ [14:18:32] zhstd, with a predefined dictionary for compressing chinese text 🤔 [14:18:49] (03CR) 10CI reject: [V:04-1] Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071882 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [14:19:16] it is super hard to compress chinese text, i spent a lot of time trying to create reasonable Finite-State Automata for Chinese language conversion. It's so much easier when each node only has ~26 output edges. [14:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:21:31] (03PS2) 10Elukey: profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) [14:21:31] (03PS1) 10Elukey: profile::puppetserver::configmaster: use gitpuppet for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1071884 (https://phabricator.wikimedia.org/T374443) [14:22:00] (03PS1) 10Arnaudb: mariadb: productionize db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) [14:22:01] (03CR) 10Arnaudb: "this is similar to previous patches, mostly a sanity check before running the clone logic" [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:22:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3939/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071884 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:22:54] (03CR) 10JHathaway: [C:03+1] profile::puppetserver::configmaster: use gitpuppet for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1071884 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:23:12] (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2002 with kafka-main2006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071880 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [14:23:28] fortunately the maintenance script output isn’t chinese, it’s just long because the wiki is big ^^ [14:23:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071884 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:24:14] cscott: I think we’re all done for now then? [14:24:14] any issues with me uploading a large file or two to testwiki with mwscript while this work is going on? [14:24:28] jouncebot: nowandnext [14:24:28] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [14:24:28] In 0 hour(s) and 35 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [14:24:38] hnowlan: I just finished, actually [14:24:44] cool, thanks [14:24:52] (03Merged) 10jenkins-bot: Replace kafka-main2002 with kafka-main2006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071880 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [14:24:52] though I was wondering if I could just sling out one or two of MatmaRex’ no-op config cleanups (IIUC) [14:24:57] but you can go ahead anyway I think [14:25:04] (03CR) 10Elukey: [C:03+2] profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:25:10] (03PS1) 10Muehlenhoff: Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) [14:25:11] (03CR) 10Elukey: profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:25:16] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver::configmaster: use gitpuppet for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1071884 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:25:25] !log restoring leadership for partitions assigned to broker id 2002 on kafka-main-codfw - T363210 [14:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:29] (03CR) 10Ottomata: [C:03+2] refine - bump to refinery version 0.2.49 [puppet] - 10https://gerrit.wikimedia.org/r/1071871 (https://phabricator.wikimedia.org/T356762) (owner: 10Ottomata) [14:25:30] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [14:26:04] Lucas_WMDE: if you feel like it [14:26:41] although i have to step away for a moment [14:26:59] the "Remove unused…" patches should be safe to ship without me [14:27:43] (03PS4) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [14:28:41] !log [end] rolling restart of {pdns-recursor,haproxy}.service on A:dnsbox [14:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:59] (03CR) 10Ssingh: [C:03+2] puppet8: drop explicity metaparams [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway) [14:29:19] ottomata: ok to merge your change? [14:29:22] Lucas_WMDE: sorry for the delay, yes i think we're done. i'll do follow up on phab to communicate about the specific pages that were moved/broken/etc. [14:29:24] Ottomata: refine - bump to refinery version 0.2.49 (dba4e70dcd) [14:29:30] cscott: alright, thanks! [14:29:44] Lucas_WMDE: if you can postmortem any lessons learned in the phab task that would be helpful for whatever poor sap has to do enwiki in a few days :) [14:29:44] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:29:48] !log UTC afternoon backport+config window done [14:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:55] RESOLVED: SystemdUnitFailed: envoyproxy.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [14:30:35] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [14:30:55] FIRING: SystemdUnitFailed: envoyproxy.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P68804 and previous config saved to /var/cache/conftool/dbconfig/20240910-143156-ladsgroup.json [14:32:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 2%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68805 and previous config saved to /var/cache/conftool/dbconfig/20240910-143211-arnaudb.json [14:32:26] (03PS5) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [14:35:15] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [14:35:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2025.codfw.wmnet [14:35:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2033.codfw.wmnet [14:35:55] RESOLVED: SystemdUnitFailed: envoyproxy.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:55] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3940/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:57] (03PS3) 10Elukey: profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) [14:37:17] !log sudo cumin -b11 "A:cp" 'run-puppet-agent --enable "merging CRs 1065286"' [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:45] (03CR) 10JHathaway: [C:03+1] profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:40:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:40:28] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10134340 (10ssingh) [14:41:29] (03CR) 10JMeybohm: [C:03+1] profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:42:05] (03CR) 10Elukey: [C:03+2] profile::tlsproxy::envoy: add require for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1071881 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:43:30] (03PS1) 10Ladsgroup: conftool-data: Remove pc5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1071886 (https://phabricator.wikimedia.org/T374355) [14:43:57] (03CR) 10Dreamy Jazz: ipoid: Set activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [14:44:22] (03PS2) 10GergesShamon: [arwiki] Change the wordmark and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) [14:45:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:46:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:47:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T371742)', diff saved to https://phabricator.wikimedia.org/P68806 and previous config saved to /var/cache/conftool/dbconfig/20240910-144703-ladsgroup.json [14:47:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [14:47:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:47:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 3%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68807 and previous config saved to /var/cache/conftool/dbconfig/20240910-144716-arnaudb.json [14:47:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [14:47:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T371742)', diff saved to https://phabricator.wikimedia.org/P68808 and previous config saved to /var/cache/conftool/dbconfig/20240910-144725-ladsgroup.json [14:49:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:50:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10134421 (10elukey) ` elukey@config-master1001:~$ curl https://puppetserver1001.eqiad.wmnet/puppet-sha1.txt 68278f7164f8b827af562... [14:51:58] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:52:36] cscott: left a comment on the task with some tips [14:52:38] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10134427 (10Dreamy_Jazz) It appears because of the DB connection issues that no data is actu... [14:53:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno) [14:54:43] (03PS1) 10Clément Goubert: sre.hosts.rename: Mask puppet-agent-timer [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) [14:54:57] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10134464 (10Dreamy_Jazz) [14:56:10] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10134469 (10Clement_Goubert) Sorry I didn't see the updates to the discussion before merging the previous iteration. Patch u... [14:56:23] Lucas_WMDE: thanks so much for your help! [14:56:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:56:58] RESOLVED: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:57:30] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [15:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500). [15:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10134505 (10Jhancock.wm) hey @elukey I couldn't get into the gui either. connected a console and changed the redirect from COM1 to SOL/COM2. also this was on the console... [15:01:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:02:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:02:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [15:02:17] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:02:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 4%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68809 and previous config saved to /var/cache/conftool/dbconfig/20240910-150222-arnaudb.json [15:02:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:02:43] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:02:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:42] !log brennen@deploy1003 Started deploy [phabricator/deployment@84ada67]: deploy phab2002 for T374458 [15:03:45] T374458: Deploy Phabricator/Phorge 2024-09-10 - https://phabricator.wikimedia.org/T374458 [15:04:18] !log brennen@deploy1003 Finished deploy [phabricator/deployment@84ada67]: deploy phab2002 for T374458 (duration: 00m 36s) [15:04:25] (03PS1) 10GergesShamon: [arwiki] change Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071888 (https://phabricator.wikimedia.org/T374430) [15:04:58] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:04:58] !log brennen@deploy1003 Started deploy [phabricator/deployment@84ada67]: deploy phab1004 for T374458 [15:05:48] !log brennen@deploy1003 Finished deploy [phabricator/deployment@84ada67]: deploy phab1004 for T374458 (duration: 00m 50s) [15:06:38] (03CR) 10Elukey: "qq to better understand - is manage_puppet_ca_file in hiera now obsoleted? Should it be removed as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [15:06:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2205.codfw.wmnet with reason: T374425 [15:06:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2205.codfw.wmnet with reason: T374425 [15:06:46] T374425: db2205 stuck replication/processlist - https://phabricator.wikimedia.org/T374425 [15:08:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2005.wikimedia.org [reason: T373097 codfw maintenance] [15:08:57] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [15:09:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [15:09:45] (03CR) 10Ebernhardson: [C:03+2] cirrus-streaming-updater: bump to v20240910132552-fa373fd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071875 (owner: 10DCausse) [15:09:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:09:47] (03PS1) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) [15:09:58] (03CR) 10Clément Goubert: [C:03+2] sre.hosts.rename: Mask puppet-agent-timer [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [15:10:00] (03PS1) 10Snwachukwu: Change New Eventschemas Git URLs [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) [15:10:31] (03CR) 10CI reject: [V:04-1] Configure QuickSurvey for Web empty search state experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [15:10:47] (03Abandoned) 10Muehlenhoff: Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071882 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [15:10:52] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to v20240910132552-fa373fd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071875 (owner: 10DCausse) [15:12:02] (03CR) 10Muehlenhoff: "It's set to false globally already. I haven't removed it in this patch, since there is also some code in the Puppet 5 masters frontend whi" [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [15:12:13] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:12:17] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:12:18] (03CR) 10Clément Goubert: [C:03+1] sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [15:12:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:40] (03PS2) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) [15:15:56] (03CR) 10Vgutierrez: [C:03+1] varnish: Remove carriers netmap [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:16:36] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:16:43] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:17:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 5%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68810 and previous config saved to /var/cache/conftool/dbconfig/20240910-151729-arnaudb.json [15:17:56] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134579 (10Jhancock.wm) @Clement_Goubert I think this is y'alls. correct me if I'm wrong. I see in the logs that disk 1 failed and is now foreign. The first thing the Dell troubleshooting wants to... [15:18:07] (03CR) 10Clément Goubert: "LGTM, see comment on task-id being optional." [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [15:19:18] (03CR) 10Clément Goubert: [C:03+1] sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [15:19:42] (03CR) 10Clément Goubert: [C:03+1] sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [15:19:44] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:20:21] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [15:22:07] (03PS1) 10Hamish: dd arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071895 (https://phabricator.wikimedia.org/T374455) [15:22:58] (03CR) 10CI reject: [V:04-1] dd arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071895 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [15:23:23] (03CR) 10CI reject: [V:04-1] sre.hosts.rename: Mask puppet-agent-timer [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [15:24:17] (03CR) 10Arnaudb: [C:03+1] conftool-data: Remove pc5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1071886 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [15:24:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:25:52] (03CR) 10Hashar: logging: Replace 'blackhole' handler with no handlers at all (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [15:25:57] (03Abandoned) 10Hamish: dd arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071895 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [15:26:20] jouncebot: nowandnext [15:26:21] For the next 0 hour(s) and 33 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [15:26:21] In 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1600) [15:26:37] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374407#10134636 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alerts cleared on their own. [15:26:46] (03PS1) 10Ssingh: sre.dns.roll-restart-ntp: s/ntpd/ntpsec [cookbooks] - 10https://gerrit.wikimedia.org/r/1071899 [15:28:36] (03PS1) 10Jelto: gitlab: enable nftables throttling (drop) [puppet] - 10https://gerrit.wikimedia.org/r/1071900 (https://phabricator.wikimedia.org/T366882) [15:28:38] (03PS1) 10Jelto: gerrrit: enable nftables throttling (drop) [puppet] - 10https://gerrit.wikimedia.org/r/1071901 (https://phabricator.wikimedia.org/T365259) [15:29:17] (03PS1) 10Hamish: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) [15:30:09] (03CR) 10CI reject: [V:04-1] Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [15:30:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3941/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071900 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:30:52] (03CR) 10Scott French: "Thanks again for the reviews, Hugh." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [15:31:24] (03CR) 10Dzahn: [C:03+1] "per meeting discussion" [puppet] - 10https://gerrit.wikimedia.org/r/1071900 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:31:45] (03CR) 10Scott French: [C:03+2] php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [15:31:47] (03CR) 10Dzahn: [C:03+1] "per meeting discussion" [puppet] - 10https://gerrit.wikimedia.org/r/1071901 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:31:50] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [15:32:27] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [15:32:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 15%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68811 and previous config saved to /var/cache/conftool/dbconfig/20240910-153234-arnaudb.json [15:32:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3942/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071901 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:32:39] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [15:32:43] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [15:32:44] (03PS2) 10Hamish: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) [15:33:13] (03PS2) 10GergesShamon: [arwiki] change Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071888 (https://phabricator.wikimedia.org/T374430) [15:33:24] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134659 (10Clement_Goubert) Yep that's ours. I'll depool the node so you can reseat when you want. [15:33:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2092.codfw.wmnet [15:33:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2092.codfw.wmnet [15:33:37] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134665 (10ops-monitoring-bot) depool host wikikube-worker2092.codfw.wmnet by cgoubert@cumin1002 with reason: Degraded RAID [15:33:39] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134666 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker2092.codfw.wmnet completed: - wik... [15:34:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [15:34:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [15:34:46] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134668 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2597914-1845-48e0-a060-39e43e562886) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host... [15:34:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [15:35:17] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable nftables throttling (drop) [puppet] - 10https://gerrit.wikimedia.org/r/1071900 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:35:19] (03CR) 10Dzahn: [C:03+2] gerrrit: enable nftables throttling (drop) [puppet] - 10https://gerrit.wikimedia.org/r/1071901 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:35:33] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134670 (10Clement_Goubert) Host depooled and downtimed for a week, all yours. [15:36:20] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10134673 (10Jhancock.wm) @dcaro we got this automated ticket and I saw your ticket T374467 that the drive errors are gone. I logged into the server's gui just now and it's showing several memory error... [15:37:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:37:13] (03PS6) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [15:37:13] ACKNOWLEDGEMENT - Kafka broker TLS certificate validity on kafka-jumbo1010 is CRITICAL: SSL CRITICAL - Certificate kafka-jumbo1010.eqiad.wmnet valid until 2024-09-17 13:08:00 +0000 (expires in 6 days) Btullis Data Platform SRE will work on this. T374468 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:37:23] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10134675 (10dcaro) >>! In T374422#10134671, @Jhancock.wm wrote: > @dcaro we got this automated ticket and I saw your ticket T374467 that the drive errors are gone. > I logged into the server's gui jus... [15:37:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [15:38:09] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134676 (10Jhancock.wm) reseated. [15:38:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071888 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [15:38:35] (03CR) 10Kosta Harlan: ipoid: Set activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [15:39:00] !log enabling throttling on GitLab hosts - T366882 [15:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:04] T366882: implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882 [15:39:22] !log enabling throttling on Gerrit hosts - T365259 [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:43] (03CR) 10Arnaudb: [C:03+1] dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) (owner: 10Volans) [15:39:49] FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:41:10] (03CR) 10Kosta Harlan: ipoid: Set activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [15:41:17] (03PS1) 10Dreamy Jazz: [CheckUser] Don't write to central indexes when no CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071903 (https://phabricator.wikimedia.org/T374462) [15:41:23] jouncebot: now [15:41:23] For the next 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [15:41:40] jouncebot: nowandnext [15:41:40] For the next 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [15:41:40] In 0 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1600) [15:42:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:42:12] (03CR) 10Dreamy Jazz: ipoid: Set activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [15:42:12] (03CR) 10Ssingh: [C:03+2] sre.dns.roll-restart-ntp: s/ntpd/ntpsec [cookbooks] - 10https://gerrit.wikimedia.org/r/1071899 (owner: 10Ssingh) [15:44:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:45:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2126 db2165 db2166 db2192 db2208 es2037 - T370852', diff saved to https://phabricator.wikimedia.org/P68813 and previous config saved to /var/cache/conftool/dbconfig/20240910-154540-arnaudb.json [15:45:44] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [15:45:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: network maintenance T373097 [15:46:02] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [15:46:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: network maintenance T373097 [15:46:33] Want to deploy some backports that will likely be a train blocker if not deployed shortly. [15:46:36] jouncebot: nowandnext [15:46:36] For the next 0 hour(s) and 13 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1500) [15:46:36] In 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1600) [15:47:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134741 (10ABran-WMF) db/es hosts have been depooled [15:47:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68814 and previous config saved to /var/cache/conftool/dbconfig/20240910-154740-arnaudb.json [15:47:54] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134736 (10Clement_Goubert) It's not showing up in system, and still shows foreign on the RAID controler interface, but that host is part of {T358489} and should not act... [15:48:14] (03PS1) 10Dreamy Jazz: Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071906 (https://phabricator.wikimedia.org/T374462) [15:48:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [15:49:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10134744 (10VRiley-WMF) @ABran-WMF I'm taking a look at this. I will update with results. [15:50:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134763 (10cmooney) >>! In T373097#10134741, @ABran-WMF wrote: > db/es hosts have been depooled thanks for confirming! [15:51:11] (03CR) 10Dreamy Jazz: [C:03+2] [CheckUser] Don't write to central indexes when no CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071903 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:51:23] (03CR) 10Dreamy Jazz: [C:03+2] Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071906 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:51:30] (03PS2) 10Dreamy Jazz: Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071907 (https://phabricator.wikimedia.org/T374462) [15:52:15] (03Merged) 10jenkins-bot: [CheckUser] Don't write to central indexes when no CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071903 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:52:19] (03PS3) 10Dreamy Jazz: Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071907 (https://phabricator.wikimedia.org/T374462) [15:53:04] (03CR) 10Cathal Mooney: "Thanks for the help. This is somewhat ugly now, however I want to review the overall setup here with Arzhel when he gets back, we may cha" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [15:53:20] (03CR) 10RLazarus: [C:03+1] mediawiki: Move job spec for reuse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071864 (owner: 10Clément Goubert) [15:53:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T371742)', diff saved to https://phabricator.wikimedia.org/P68815 and previous config saved to /var/cache/conftool/dbconfig/20240910-155324-ladsgroup.json [15:53:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:54:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10134777 (10VRiley-WMF) a:03VRiley-WMF [15:54:48] (03CR) 10Dreamy Jazz: [C:03+2] Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071907 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:55:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071906 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:55:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071907 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [15:56:30] !log move server uplinks in Netbox from asw-c5-codfw to lsw1-c5-codfw to prep physical moves T373097 [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [15:57:08] !log move server uplinks in Netbox from asw-c4-codfw to lsw1-c4-codfw to prep physical moves T373097 [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:59] !log push server and vlan configuration to lsw1-c4-codfw with Homer to prep physical moves T373097 [15:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:22] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10134785 (10Clement_Goubert) [15:58:30] (03PS1) 10Brouberol: airflow: store the connections.yaml content in a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) [15:58:32] (03PS1) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) [15:59:29] Looks like my backport will collide with the puppet window [15:59:32] Is this a problem? [15:59:44] !log push server and vlan configuration to lsw1-c5-codfw with Homer to prep physical moves T373097 [15:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1600). Please do the needful. [16:00:04] lucaswerkmeister: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:23] o/ [16:00:25] Hello. Currently deploying a patch to fix a train blocker. [16:00:35] my change isn’t urgent [16:00:42] I have to leave relatively early but worst case I’ll just reschedule [16:00:44] The changes are yet to merge, so I could stop the scap command and let the puppet go forward first? [16:01:15] ETA is listed at 20 mins. [16:01:47] well, let’s not stop anything before some puppet deployers actually show up ^^ [16:01:55] Sure. [16:02:01] here, either way works for me :) [16:02:03] looking at the patch now [16:02:05] hi :) [16:02:22] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Move server uplinks codfw racks C4 [16:02:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Move server uplinks codfw racks C4 [16:02:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68816 and previous config saved to /var/cache/conftool/dbconfig/20240910-160247-arnaudb.json [16:02:51] Dreamy_Jazz: as long as lucaswerkmeister doesn't mind, I say go ahead and we can sequence after yours [16:02:58] !log commence maintenance - move server uplinks from old to new switch codfw rack C4 T373097 [16:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:05] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [16:03:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134808 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c5ef5c49-317c-49af-b11b-61e58fe45620) set by cmoon... [16:03:12] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10134813 (10Jhancock.wm) sounds good to me. [16:03:17] Lucas has to be away early AFAIK [16:03:51] yeah, I have to leave around 16:30 or :35 UTC at the latest [16:04:06] but the extent to which I can test my change is limited anyway [16:04:30] I guess I’d have to go wake up Lucas_WMDE and grab the secret w/fatal-error.php password from production [16:05:45] as far as I’m concerned, my Puppet changes could be merged while I’m away, the question is just whether you’d be happy with that ^^ [16:06:43] I'd like to test it but I'm happy to set up another time with you to do that, either in the next puppet window or whenever there's a free moment between deployments before then [16:06:44] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:07:28] another option is, if you point me at where to find that password I can also test it, your only risk is that I'll err on the side of rolling back if I have any questions :) [16:08:18] it seems to be somewhere in private/FatalErrorSettings.php [16:08:23] I don’t think I’ve used it before either [16:08:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P68817 and previous config saved to /var/cache/conftool/dbconfig/20240910-160831-ladsgroup.json [16:08:58] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 26 hosts with reason: Move server uplinks codfw racks C5 [16:09:00] rzl: tomorrow 16:00 UTC (or maybe 16:30) would also work for me, I should be back home by then [16:09:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 26 hosts with reason: Move server uplinks codfw racks C5 [16:09:28] I have a meeting at 16 but I can do 16:30 [16:09:30] (03PS1) 10DCausse: cirrus-streaming-updater: fix wikiids filter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071911 [16:09:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134820 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a5d7ae66-6b48-4bdb-8951-87b0e41404de) set by cmoon... [16:09:47] (03PS2) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) [16:09:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:10:11] okay, I’ll try to be around then and maybe ping you [16:11:17] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: fix wikiids filter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071911 (owner: 10DCausse) [16:11:29] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:11:50] lucaswerkmeister: sgtm! I'll plan for that, worst case we can get it out in the Thursday Puppet window [16:12:10] oh, I didn’t know there was a second one this week [16:12:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2137.codfw.wmnet onto db2237.codfw.wmnet [16:12:22] (03Merged) 10jenkins-bot: cirrus-streaming-updater: fix wikiids filter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071911 (owner: 10DCausse) [16:12:24] I’ll be under roughly the same time constraint as today then, but sounds like a fallback plan ^^ [16:12:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:25] thanks! [16:12:53] (03PS7) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [16:13:36] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:14:02] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:14:23] rzl: btw lucaswerkmeister's change requires a mw-on-k8s deployment to update the file in the containers [16:14:49] yep [16:15:01] (appreciate the callout though) [16:15:11] * lucaswerkmeister is interested [16:15:20] puppet puts the file on the deployment host, and then the next image build there picks it up? [16:15:24] yeah exactly [16:15:27] I see [16:15:31] so it would even be testable on mwdebug? [16:15:43] yeah, it'll still get to mwdebug the old-fashioned way [16:15:46] cool [16:15:58] no image build though, it's mounted as a configmap iirc [16:16:08] or I may be mistaken [16:16:26] oh! that'd be simpler than I thought, I'll double-check in the intervening time [16:16:28] (03CR) 10Btullis: "You have got me wondering now, is it inherently more secure to render this as a Secret, then? Are there any downsides to us?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [16:16:44] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:17:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2040.codfw.wmnet [16:17:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2040.codfw.wmnet [16:17:33] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2041.codfw.wmnet [16:17:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2041.codfw.wmnet [16:17:42] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2058.codfw.wmnet [16:17:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2058.codfw.wmnet [16:17:50] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2440.codfw.wmnet [16:17:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2440.codfw.wmnet [16:17:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68818 and previous config saved to /var/cache/conftool/dbconfig/20240910-161753-arnaudb.json [16:17:56] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:17:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134839 (10cmooney) Move done, all migrated hosts are pinging again no issues to report. [16:17:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2442.codfw.wmnet [16:17:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2442.codfw.wmnet [16:18:00] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:18:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2443.codfw.wmnet [16:18:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2443.codfw.wmnet [16:18:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2033.codfw.wmnet [16:18:11] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2011.codfw.wmnet [16:18:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2011.codfw.wmnet [16:18:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2025.codfw.wmnet [16:18:18] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2012.codfw.wmnet [16:18:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2012.codfw.wmnet [16:18:25] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2013.codfw.wmnet [16:18:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2013.codfw.wmnet [16:18:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68819 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json [16:18:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68820 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json [16:18:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68821 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json [16:18:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68822 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json [16:18:35] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2039.codfw.wmnet [16:18:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2039.codfw.wmnet [16:18:41] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [16:19:08] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [16:20:51] (03CR) 10Btullis: airflow: enable s3 logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [16:21:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [16:21:44] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:22:35] (03Merged) 10jenkins-bot: Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1071906 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [16:23:08] (03Merged) 10jenkins-bot: Don't attempt to interact with central indexes for some wikis [extensions/CheckUser] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071907 (https://phabricator.wikimedia.org/T374462) (owner: 10Dreamy Jazz) [16:23:31] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1071903|[CheckUser] Don't write to central indexes when no CentralAuth (T374462)]], [[gerrit:1071906|Don't attempt to interact with central indexes for some wikis (T374462)]], [[gerrit:1071907|Don't attempt to interact with central indexes for some wikis (T374462)]] [16:23:34] T374462: CheckUser data is not being purged for labswiki - https://phabricator.wikimedia.org/T374462 [16:23:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P68823 and previous config saved to /var/cache/conftool/dbconfig/20240910-162339-ladsgroup.json [16:24:27] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10134882 (10ABran-WMF) db/es hosts are repooling [16:24:40] (03CR) 10Krinkle: [C:03+1] "OK to deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum) [16:25:27] claime, lucaswerkmeister: yep good call, a configmap it is, via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/manifests/web/yaml_defs.pp#48 -> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/mediawiki/templates/lamp/configmap.yaml.tpl#38 [16:25:29] (03PS1) 10Clément Goubert: sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) [16:25:43] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1071903|[CheckUser] Don't write to central indexes when no CentralAuth (T374462)]], [[gerrit:1071906|Don't attempt to interact with central indexes for some wikis (T374462)]], [[gerrit:1071907|Don't attempt to interact with central indexes for some wikis (T374462)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:25:45] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:25:53] we have such a normal number of kinds of config, in such a normal number of config repositories [16:26:57] (03PS8) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [16:26:58] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406#10134888 (10KFrancis) Hi @jeremyb Thanks for checking in. We have sunsetted the L2 form. I am happy to facilitate an NDA from my end though. Please send your full name, email... [16:26:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2092.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [16:27:33] rzl: much normal, very config [16:29:40] :D [16:30:18] yeah earlier today I was suddenly wondering if I’d scheduled my patch for the right window since w/fatal-error.php is in mediawiki-config but the file I touched in puppet ^^ [16:30:20] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071903|[CheckUser] Don't write to central indexes when no CentralAuth (T374462)]], [[gerrit:1071906|Don't attempt to interact with central indexes for some wikis (T374462)]], [[gerrit:1071907|Don't attempt to interact with central indexes for some wikis (T374462)]] (duration: 06m 49s) [16:30:27] T374462: CheckUser data is not being purged for labswiki - https://phabricator.wikimedia.org/T374462 [16:32:04] I'm done. Apologies for crashing into the puppet window. [16:32:34] (03CR) 10Clément Goubert: [C:03+2] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [16:33:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: post reimage && maintenance repool', diff saved to https://phabricator.wikimedia.org/P68824 and previous config saved to /var/cache/conftool/dbconfig/20240910-163300-arnaudb.json [16:33:28] Dreamy_Jazz: Did you at least do a roll and point an imaginary gun after crashing through it? [16:33:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68825 and previous config saved to /var/cache/conftool/dbconfig/20240910-163337-arnaudb.json [16:33:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68826 and previous config saved to /var/cache/conftool/dbconfig/20240910-163337-arnaudb.json [16:33:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68827 and previous config saved to /var/cache/conftool/dbconfig/20240910-163338-arnaudb.json [16:33:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68828 and previous config saved to /var/cache/conftool/dbconfig/20240910-163338-arnaudb.json [16:33:40] :D [16:33:47] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [16:33:54] Maybe more James Bond style [16:34:05] !log Repooled kubernetes2040.codfw.wmnet kubernetes2041.codfw.wmnet kubernetes2058.codfw.wmnet mw2440.codfw.wmnet mw2442.codfw.wmnet mw2443.codfw.wmnet parse2011.codfw.wmnet parse2012.codfw.wmnet parse2013.codfw.wmnet wikikube-worker2039.codfw.wmnet - T373097 [16:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] what I want to know is how you did that while clearing an obstacle off the train tracks [16:34:54] it's not quite a mixed metaphor yet but it's getting dangerously close [16:35:12] dangerzone.gif [16:35:18] * lucaswerkmeister throws “crashing through the puppet window like the kool aid man” into the metaphor mix [16:35:40] lucaswerkmeister: except it's not juice, just disappointment [16:35:53] anyway, I’m afk now, see you :) [16:35:57] o/ [16:37:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68829 and previous config saved to /var/cache/conftool/dbconfig/20240910-163722-arnaudb.json [16:37:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68830 and previous config saved to /var/cache/conftool/dbconfig/20240910-163742-arnaudb.json [16:38:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T371742)', diff saved to https://phabricator.wikimedia.org/P68831 and previous config saved to /var/cache/conftool/dbconfig/20240910-163846-ladsgroup.json [16:38:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1195.eqiad.wmnet with reason: Maintenance [16:38:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:39:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1195.eqiad.wmnet with reason: Maintenance [16:39:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T371742)', diff saved to https://phabricator.wikimedia.org/P68832 and previous config saved to /var/cache/conftool/dbconfig/20240910-163908-ladsgroup.json [16:39:44] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:42:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2005.wikimedia.org [reason: end: T373097 codfw maintenance] [16:42:18] (03PS13) 10Ejegg: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) [16:42:19] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [16:43:06] !log running authdns-update [16:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:15] (03CR) 10Ejegg: "Oops, let's try to actually deploy this one!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [16:43:47] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10134931 (10Dreamy_Jazz) Fixed through T374462. [16:43:55] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10134933 (10Dreamy_Jazz) 05Open→03Resolved a:03Dreamy_Jazz [16:44:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:44] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:45:21] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:45:23] (03PS1) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [16:45:45] (03CR) 10CI reject: [V:04-1] cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [16:45:50] (03Merged) 10jenkins-bot: sre.hosts.rename: Mask puppet-agent-timer [cookbooks] - 10https://gerrit.wikimedia.org/r/1071887 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [16:47:46] (03PS1) 10Urbanecm: Babel: Set BabelUseCommunityConfiguration to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374348) [16:47:47] (03PS1) 10Urbanecm: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374348) [16:48:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68833 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json [16:48:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68834 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json [16:48:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68835 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json [16:48:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68836 and previous config saved to /var/cache/conftool/dbconfig/20240910-164843-arnaudb.json [16:48:46] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [16:49:22] (03CR) 10Alexandros Kosiaris: [C:03+2] apt: Remove mention of php72 component [puppet] - 10https://gerrit.wikimedia.org/r/1070997 (owner: 10Alexandros Kosiaris) [16:50:39] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:52:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68837 and previous config saved to /var/cache/conftool/dbconfig/20240910-165228-arnaudb.json [16:52:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68838 and previous config saved to /var/cache/conftool/dbconfig/20240910-165248-arnaudb.json [16:54:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [16:54:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:54:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [16:55:58] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10135021 (10Clement_Goubert) Reset the RAID config and the disk is still in `Foreign` state, so I can't use it for a Virtual Disk. I think a replacement is in order. [16:56:10] (03CR) 10Andrea Denisse: "Thanks for taking a look, what is folding the change?" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [16:59:19] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2092.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [16:59:44] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1700) [17:03:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68839 and previous config saved to /var/cache/conftool/dbconfig/20240910-170347-arnaudb.json [17:03:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68840 and previous config saved to /var/cache/conftool/dbconfig/20240910-170347-arnaudb.json [17:03:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68841 and previous config saved to /var/cache/conftool/dbconfig/20240910-170348-arnaudb.json [17:03:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68842 and previous config saved to /var/cache/conftool/dbconfig/20240910-170348-arnaudb.json [17:03:51] T373097: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 [17:04:44] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:07:14] (03CR) 10Dzahn: "so.. can both old and new server write to m2-master.eqiad.wmnet? I would suggest testing that with mysql client manually first. If it tur" [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:07:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68843 and previous config saved to /var/cache/conftool/dbconfig/20240910-170734-arnaudb.json [17:07:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68844 and previous config saved to /var/cache/conftool/dbconfig/20240910-170753-arnaudb.json [17:08:01] (03PS3) 10Brouberol: global_config: add the s3-eqiad-dpe external service [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) [17:09:12] (03PS3) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) [17:09:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [17:09:17] (03PS2) 10Clément Goubert: httpbb: Move wikifunctions to its own test suite [puppet] - 10https://gerrit.wikimedia.org/r/1071919 (https://phabricator.wikimedia.org/T374442) [17:10:00] (03CR) 10Brouberol: airflow: enable s3 logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [17:10:37] (03PS4) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) [17:10:37] (03PS2) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [17:10:46] (03CR) 10Brouberol: airflow: enable s3 logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [17:12:27] (03CR) 10Btullis: "Looks good to me, but I spotted an unrelated typo." [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [17:12:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum) [17:14:59] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1071919 (https://phabricator.wikimedia.org/T374442) (owner: 10Clément Goubert) [17:20:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [17:23:46] (03CR) 10Scott French: [C:03+1] "Nice! I was going to look at this later today, so thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/1071919 (https://phabricator.wikimedia.org/T374442) (owner: 10Clément Goubert) [17:24:31] sukhe: i'm sorry! yes that was good to bump. i was going to merge but there was a lock because someone else was. then meetings started [17:24:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [17:24:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:24:58] (03CR) 10Jdlrobson: [C:03+1] Configure QuickSurvey for Web empty search state experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [17:27:52] !log removing 15 files for legal compliance [17:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10135140 (10bwang) [17:30:42] (03CR) 10AOkoth: "Yeah, it doesn't." [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:31:02] ottomata: no worries! [17:34:11] (03PS1) 10Dzahn: phabricator: switch phab2002 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) [17:34:26] (03PS3) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [17:35:23] (03CR) 10AOkoth: "Eerm.. Any VRTS server in codfw connects to m2-slave (there is a separate hiera file) meaning it can't write to the database. Plus the vrt" [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:36:12] (03CR) 10Dzahn: "How about this: schedule downtime (with cookbook) for vrts2001 for something like a week. Then actually shut down vrts2001. Result: no " [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [17:37:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [17:37:59] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 9742 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [17:39:04] !log removing 4 files for legal compliance [17:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:56] (03CR) 10Dzahn: phabricator: switch phab2002 to nftables as firewall provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:44:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T371742)', diff saved to https://phabricator.wikimedia.org/P68845 and previous config saved to /var/cache/conftool/dbconfig/20240910-174454-ladsgroup.json [17:44:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:46:28] (03PS4) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [17:47:09] (03PS2) 10Dzahn: phabricator: switch phab2002 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) [17:48:42] (03PS1) 10Dzahn: requesttracker: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) [17:49:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10135189 (10VRiley-WMF) I have attempted a few troubleshooting steps. I have uploaded logs to Dell under SR 197398410. Awaiting results. [17:49:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [17:50:03] (03PS1) 10Dzahn: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) [17:50:42] (03PS1) 10Dzahn: peopleweb: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) [17:52:57] (03PS17) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:53:31] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1071925/3946/moscovium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:54:06] (03PS5) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [17:55:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [17:55:58] (03PS2) 10Dzahn: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) [17:57:52] (03PS2) 10Ladsgroup: conftool-data: Remove pc5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1071886 (https://phabricator.wikimedia.org/T374355) [17:57:56] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [17:57:57] (03CR) 10Ladsgroup: [V:03+2 C:03+2] conftool-data: Remove pc5 for now [puppet] - 10https://gerrit.wikimedia.org/r/1071886 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [17:58:44] (03CR) 10Dzahn: "looking at "tcpdump port 443" all the connections appear to come from cp* machines" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:00:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P68846 and previous config saved to /var/cache/conftool/dbconfig/20240910-180001-ladsgroup.json [18:00:05] dduvall and dancy: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T1800). [18:00:17] o/ [18:01:01] o/ [18:01:52] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly allocated LVS VIPs for mwdebug-next - swfrench@cumin2002" [18:01:57] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly allocated LVS VIPs for mwdebug-next - swfrench@cumin2002" [18:01:58] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:03:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:04:00] (03CR) 10Kosta Harlan: ipoid: Set activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [18:04:05] !log ran sre.dns.netbox after adding mwdebug-next LVS VIPs for T372604 [18:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:08] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [18:06:59] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071931 (https://phabricator.wikimedia.org/T373641) [18:07:01] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071931 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:07:43] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071931 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:08:28] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10135206 (10Jhancock.wm) drive has been replaced! lmk If there are any other issues. [18:10:50] (03CR) 10Kosta Harlan: "is Ifd1a048ccacfe1968d8b7038f22470594f523c4d somehow related?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [18:14:29] (03PS1) 10Scott French: wmnet: A and PTR records for mwdebug-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1071932 (https://phabricator.wikimedia.org/T372604) [18:14:37] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.22 refs T373641 [18:14:40] T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641 [18:15:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P68847 and previous config saved to /var/cache/conftool/dbconfig/20240910-181508-ladsgroup.json [18:16:07] (03PS4) 10Brouberol: global_config: add the s3-eqiad-dpe external service [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) [18:16:13] (03CR) 10Brouberol: "The typo is now fixed. Good spot!" [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [18:16:50] (03CR) 10Ssingh: [C:03+1] wmnet: A and PTR records for mwdebug-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1071932 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:19:53] (03CR) 10Brouberol: "So, this is really me erring on the side of caution. I'm a simple man. I see a secret key, I put it in a Secret." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [18:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:20:58] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@8be3b36] (releasing): (no justification provided) [18:21:15] (03CR) 10Scott French: "Thanks, Sukhbir!" [dns] - 10https://gerrit.wikimedia.org/r/1071932 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:22:25] (03CR) 10Scott French: [C:03+2] wmnet: A and PTR records for mwdebug-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1071932 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:23:21] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406#10135229 (10Dzahn) > so I can detect when I am spamming the logs before I disrupt the deployment train The `logspam` checker script comes to mind when reading this. Are you a... [18:24:34] FIRING: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:43] PROBLEM - jenkins_service_running on releases1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:26:42] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@8be3b36] (releasing): (no justification provided) (duration: 05m 43s) [18:26:43] RECOVERY - jenkins_service_running on releases1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:26:59] (03CR) 10Brouberol: "> It occured to me that with Superset we render superset-config as a ConfigMap containing some secret strings, such as SQLALCHEMY_DATABASE" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [18:29:34] RESOLVED: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:45] (03PS1) 10Scott French: service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) [18:30:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T371742)', diff saved to https://phabricator.wikimedia.org/P68848 and previous config saved to /var/cache/conftool/dbconfig/20240910-183016-ladsgroup.json [18:30:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:30:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:30:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:30:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:30:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:30:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T371742)', diff saved to https://phabricator.wikimedia.org/P68849 and previous config saved to /var/cache/conftool/dbconfig/20240910-183055-ladsgroup.json [18:31:39] (03CR) 10JHathaway: [C:03+1] Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [18:33:55] (03PS3) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) [18:34:18] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10135293 (10Jhancock.wm) made a service request with Dell. Will update when it arrives. [18:34:49] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10135294 (10Jhancock.wm) requested submitted. I'll update when it gets here. [18:38:24] !log ran authdns-update on dns1004 (18:25 UTC) for T372604 [18:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:27] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [18:38:29] (03CR) 10Ssingh: [C:03+1] "Looks good to me but I haven't verified the log format 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [18:42:11] (03PS2) 10Scott French: service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) [18:43:00] (03PS1) 10BCornwall: varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 [18:43:21] (03CR) 10CI reject: [V:04-1] varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [18:44:55] (03CR) 10Vgutierrez: [C:04-1] cache:haproxy: introduce extended logging on socket for haproxykafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [18:47:16] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided) [18:47:51] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided) (duration: 00m 35s) [18:53:35] (03PS1) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071936 (https://phabricator.wikimedia.org/T373195) [18:55:36] (03PS2) 10BCornwall: varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 [18:55:58] (03CR) 10CI reject: [V:04-1] varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [18:56:45] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:56:47] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:57:00] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:58:07] (03CR) 10JHathaway: [C:03+2] P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [19:00:36] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:01:54] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:06:33] <_Gerges> jouncebot: next [19:06:33] In 0 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T2000) [19:06:46] (03PS2) 10Andrea Denisse: alert: Make alert2002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) [19:09:23] (03PS2) 10Andrea Denisse: alert: Make alert1002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) [19:09:30] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:09:37] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:11:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:11:22] (03CR) 10JHathaway: [C:03+2] realm.pp: drop $other_site global [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [19:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:12:17] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:12:30] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:14:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 5.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:20] (03PS3) 10BCornwall: varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 [19:15:47] (03PS2) 10Jbond: resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) [19:15:51] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [19:17:11] (03CR) 10CI reject: [V:04-1] varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [19:19:44] (03PS3) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) [19:19:44] (03PS3) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) [19:20:26] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:20:29] (03PS4) 10BCornwall: varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 [19:20:33] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:21:06] 06SRE: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489 (10JLam-WMF) 03NEW [19:26:16] 06SRE: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489#10135552 (10JLam-WMF) [19:26:54] 06SRE: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489#10135557 (10JLam-WMF) [19:27:09] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3947/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [19:27:41] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10135558 (10jhathaway) [19:27:51] 06SRE: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489#10135562 (10JLam-WMF) [19:29:03] (03PS6) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) [19:29:14] 06SRE: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489#10135573 (10JLam-WMF) [19:30:10] 06SRE, 06Infrastructure-Foundations, 10Mail: Having issues with Zendesk e-mail notifications - https://phabricator.wikimedia.org/T374489#10135576 (10JLam-WMF) [19:32:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [19:35:31] (03CR) 10Fabfur: cache:haproxy: introduce extended logging on socket for haproxykafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [19:36:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T371742)', diff saved to https://phabricator.wikimedia.org/P68851 and previous config saved to /var/cache/conftool/dbconfig/20240910-193622-ladsgroup.json [19:36:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:45:34] (03CR) 10Jdlrobson: [C:04-1] Configure QuickSurvey for Web empty search state experiments (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [19:47:41] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:47:46] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:50:21] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:50:25] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:51:25] <_Gerges> jouncebot: next [19:51:25] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T2000) [19:51:30] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:51:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P68852 and previous config saved to /var/cache/conftool/dbconfig/20240910-195130-ladsgroup.json [19:51:36] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:53:31] (03PS1) 10Scott French: mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) [19:54:47] !log removing 6 files for legal compliance [19:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:18] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Create PCC Puppet 8 nodes - https://phabricator.wikimedia.org/T374495 (10jhathaway) 03NEW [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240910T2000). Please do the needful. [20:00:05] physikerwelt, _Gerges, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] <_Gerges> I will be there in half an hour. [20:00:15] o/ [20:00:25] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to logstash for Jeremyb - https://phabricator.wikimedia.org/T374406#10135689 (10jeremyb) >>! In T374406#10133385, @Ladsgroup wrote: > It first needs a sponsor from a wmf staff. ok, I had some ideas of people to ask but I'm still trying to figur... [20:00:55] I am here, ready to test [20:02:21] I can deploy [20:03:22] kindrobot: do I need to rebase? [20:04:22] physikerwelt: yes [20:04:38] (03PS2) 10Physikerwelt: Enable native MathML by default on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) [20:05:09] * cjming thanks kindrobot [20:05:11] jan_drewniak: I'm noticing a -1 on your patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1071890 [20:05:45] (03PS4) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) [20:05:52] kindrobot: I just noticed that too, one sec [20:06:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P68853 and previous config saved to /var/cache/conftool/dbconfig/20240910-200637-ladsgroup.json [20:06:41] (03CR) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiments (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [20:07:13] (03PS2) 10Scott French: mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) [20:07:15] kindrobot: ok just uploaded a fix, should be good to go now [20:09:06] OK, great. I'm going to deploy jan_drewniak and physikerwelt first, and wait to hear back from _Gerges [20:11:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [20:11:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [20:11:21] (03Merged) 10jenkins-bot: Enable native MathML by default on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [20:11:38] jan_drewniak: can you please rebase your patch? [20:12:02] (03PS5) 10Jdrewniak: Configure QuickSurvey for Web empty search state experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) [20:12:16] kindrobot: rebased [20:12:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:12:36] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [20:12:41] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [20:13:15] (03Merged) 10jenkins-bot: Configure QuickSurvey for Web empty search state experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071890 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [20:13:34] !log kindrobot@deploy1003 Started scap sync-world: Backport for [[gerrit:1071037|Enable native MathML by default on group0 (T373703)]], [[gerrit:1071890|Configure QuickSurvey for Web empty search state experiments (T373039)]] [20:13:50] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [20:13:51] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [20:15:46] !log kindrobot@deploy1003 kindrobot, jdrewniak, physikerwelt: Backport for [[gerrit:1071037|Enable native MathML by default on group0 (T373703)]], [[gerrit:1071890|Configure QuickSurvey for Web empty search state experiments (T373039)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:59] jan_drewniak physikerwelt : please test [20:17:42] kindrobot: looks good on my end [20:18:00] works after using action=purge (which is ok) [20:18:08] <_Gerges> I here [20:18:30] syncing [20:18:33] !log kindrobot@deploy1003 kindrobot, jdrewniak, physikerwelt: Continuing with sync [20:21:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T371742)', diff saved to https://phabricator.wikimedia.org/P68854 and previous config saved to /var/cache/conftool/dbconfig/20240910-202145-ladsgroup.json [20:21:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:21:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:22:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:22:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68855 and previous config saved to /var/cache/conftool/dbconfig/20240910-202207-ladsgroup.json [20:23:12] !log kindrobot@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071037|Enable native MathML by default on group0 (T373703)]], [[gerrit:1071890|Configure QuickSurvey for Web empty search state experiments (T373039)]] (duration: 09m 37s) [20:23:26] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [20:23:26] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [20:24:50] thank you, works well [20:25:32] _Gerges: I'll deploy yours next [20:25:49] <_Gerges> Ok [20:27:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [20:27:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071888 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [20:27:46] (03Merged) 10jenkins-bot: [arwiki] Change the wordmark and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071842 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [20:27:49] (03Merged) 10jenkins-bot: [arwiki] change Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071888 (https://phabricator.wikimedia.org/T374430) (owner: 10GergesShamon) [20:28:06] !log kindrobot@deploy1003 Started scap sync-world: Backport for [[gerrit:1071842|[arwiki] Change the wordmark and the tagline (T374430)]], [[gerrit:1071888|[arwiki] change Wikipedia logo (T374430)]] [20:28:10] T374430: Change logos in Arabic Wikipedia - https://phabricator.wikimedia.org/T374430 [20:30:19] !log kindrobot@deploy1003 gergesshamon, kindrobot: Backport for [[gerrit:1071842|[arwiki] Change the wordmark and the tagline (T374430)]], [[gerrit:1071888|[arwiki] change Wikipedia logo (T374430)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:30:33] _Gerges: please test [20:31:20] !log removing 9 files for legal compliance [20:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:13] <_Gerges> On which server test? [20:35:31] (03CR) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [20:38:23] The test servers [20:39:34] <_Gerges> Ok [20:40:28] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Staging_changes [20:41:33] <_Gerges> Yes, I used WikimediaDebug [20:41:48] Great [20:41:53] <_Gerges> Page is very slow to load [20:45:02] _Gerges: any update? [20:46:06] (03PS1) 10Ladsgroup: pc5: Enable notification [puppet] - 10https://gerrit.wikimedia.org/r/1071955 (https://phabricator.wikimedia.org/T374355) [20:46:11] <_Gerges> https://ar.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ar.svg [20:46:47] <_Gerges> It appears that the svg file has been updated, but on the Wikipedia pages it has not, despite the cache being cleared. [20:47:24] (03PS7) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [20:47:38] Hmm [20:47:50] The only thing I can think of is WikimediaDebug isn't working right [20:49:19] (03CR) 10CDanis: [C:03+1] aux-services: update Docker images for Jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071872 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [20:49:52] <_Gerges> Can you try it yourself, it might be a problem with my browser? [20:51:50] What should the logo look like? [20:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:53:55] <_Gerges> You will find the word "الحُرة" encyclopedia instead of "الحرة" [20:54:45] kindrobot: I'd probably just deploy it, and then run the urls through purgeList.php [20:55:21] +1 on that [20:55:22] _Gerges: I see it. Perhaps you didn't purge your cache or do a "hard reload" [20:55:23] echo "https://ar.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ar.svg" | mwscript purgeList.php --wiki=enwiki [20:55:28] etc [20:55:59] Ah, OK. This busts the cache Reedy ? [20:56:07] it invalidates the CDN caches, yeah [20:56:11] <_Gerges> I deleted cache from Wikipedia and from the browser [20:56:18] obviously doesn't fix browsers doing odd stuff [20:56:23] but should indeed fix most issues [20:56:45] OK. Syncing [20:57:19] !log kindrobot@deploy1003 gergesshamon, kindrobot: Continuing with sync [21:01:23] (03PS1) 10Scott French: mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) [21:02:24] !log kindrobot@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071842|[arwiki] Change the wordmark and the tagline (T374430)]], [[gerrit:1071888|[arwiki] change Wikipedia logo (T374430)]] (duration: 34m 17s) [21:02:27] T374430: Change logos in Arabic Wikipedia - https://phabricator.wikimedia.org/T374430 [21:03:35] Reedy: w.r.t the purgeList command, should --wiki be set to arwiki ? [21:03:42] doesn't matter [21:04:06] we just don't have a good way to mark scripts as wiki agnostic as far as mwscript is concerned (hence many just use aawiki) [21:04:20] Okay, thanks [21:06:27] (03PS1) 10JHathaway: puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071959 (https://phabricator.wikimedia.org/T372667) [21:06:42] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071959 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:07:29] <_Gerges> Now the image has changed, I think the problem is from WikimediaDebug. [21:10:02] (03PS1) 10JHathaway: puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071960 (https://phabricator.wikimedia.org/T372667) [21:12:08] (03CR) 10Dzahn: [V:03+1 C:03+2] requesttracker: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:12:13] (03PS2) 10Dzahn: requesttracker: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) [21:12:32] (03CR) 10Ladsgroup: [C:03+2] pc5: Enable notification [puppet] - 10https://gerrit.wikimedia.org/r/1071955 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [21:13:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071960 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:13:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10135868 (10Dwisehaupt) 05Open→03Resolved All hosts built and in service. [21:14:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [21:14:28] !log purged AR wiki logos and taglines [21:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:37] !log finish UTC late backport window [21:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:46] Thank you everyone for your help <3 [21:15:38] (03CR) 10Scott French: [C:03+1] errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (https://phabricator.wikimedia.org/T291192) (owner: 10Lucas Werkmeister) [21:15:45] <_Gerges> @kindrobot: If possible you can also run a script on https://ar.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ar.svg [21:16:20] already did _Gerges [21:17:22] (03CR) 10Scott French: "Thank you both for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [21:17:24] <_Gerges> Are the taglines updated and the wordmark not updated? [21:17:25] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [21:17:53] _Gerges: https://phabricator.wikimedia.org/P68856 [21:18:08] (03CR) 10Dzahn: "Ah, I see the override now. So it sets it to the slave _because it's in codfw_. This seems all good, until the day we switch between data " [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [21:19:28] <_Gerges> Thanks [21:20:45] (03CR) 10JHathaway: [C:03+2] puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071959 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:20:50] (03CR) 10JHathaway: [C:03+2] puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071960 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:22:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68857 and previous config saved to /var/cache/conftool/dbconfig/20240910-212205-ladsgroup.json [21:22:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:23:53] (03CR) 10Dzahn: [V:03+2 C:03+2] requesttracker: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:26:02] (03PS1) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) [21:27:22] (03PS1) 10JHathaway: puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071962 (https://phabricator.wikimedia.org/T372667) [21:27:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071962 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:29:03] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [21:32:10] (03CR) 10Ssingh: varnish: Conditionally monitor vcl reloads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [21:37:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P68858 and previous config saved to /var/cache/conftool/dbconfig/20240910-213712-ladsgroup.json [21:37:33] !log removing 8 files for legal compliance [21:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:25] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10135909 (10jhathaway) [21:39:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10135906 (10cmooney) 05Open→03Resolved a:03cmooney [21:40:35] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10135912 (10jhathaway) [21:43:37] !log removing 1 file for legal compliance [21:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:43] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10135916 (10Dwisehaupt) 05Open→03Resolved a:03Dwisehaupt Host built and configured. [21:49:55] (03PS1) 10Dzahn: rt: WIP test [puppet] - 10https://gerrit.wikimedia.org/r/1071963 [21:52:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P68859 and previous config saved to /var/cache/conftool/dbconfig/20240910-215219-ladsgroup.json [21:53:08] (03Abandoned) 10Dzahn: rt: WIP test [puppet] - 10https://gerrit.wikimedia.org/r/1071963 (owner: 10Dzahn) [21:56:01] (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd100[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070352 (https://phabricator.wikimedia.org/T373651) (owner: 10Cwhite) [21:58:35] (03CR) 10Dzahn: [C:03+2] phabricator: switch phab2002 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:59:03] (03PS1) 10Jasmine: icinga: adding jasmine to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/1071964 [22:03:52] (03CR) 10Dzahn: "I think you need to create an Icinga contact in the private puppet repo in modules/secret/secrets/nagios/contacts.cfg first and make it ma" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine) [22:05:46] (03CR) 10Jdlrobson: [C:03+1] Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [22:07:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68860 and previous config saved to /var/cache/conftool/dbconfig/20240910-220726-ladsgroup.json [22:07:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [22:07:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:07:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [22:07:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68861 and previous config saved to /var/cache/conftool/dbconfig/20240910-220748-ladsgroup.json [22:08:35] (03CR) 10Dzahn: [V:03+2 C:03+2] "10_envoy_tls_termination.nft doesn't actually contain the CACHES. And wouldn't we have to pass this as src_sets to firewall::service in pr" [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:09:30] (03CR) 10Scott French: "Thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [22:10:12] (03CR) 10Dzahn: [C:03+2] "[puppetdb1003:/home/jmm] $ python3 nftables-compat-check.py phab2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1071923 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:10:18] (03PS2) 10Jasmine: icinga: adding jasmine to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/1071964 [22:10:58] (03PS6) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) [22:10:58] (03PS6) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) [22:10:58] (03PS6) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) [22:20:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [22:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:24:56] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [22:34:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:37:48] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [22:39:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:42:06] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [22:51:14] (03CR) 10Jdlrobson: [C:03+1] Turn off feature flag to move donate link everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [22:51:51] (03PS1) 10Dwisehaupt: icinga: Add frlog2002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1071970 (https://phabricator.wikimedia.org/T372933) [22:53:31] (03CR) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [22:54:38] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [22:58:00] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [23:05:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68862 and previous config saved to /var/cache/conftool/dbconfig/20240910-230518-ladsgroup.json [23:05:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:09:34] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [23:15:21] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [23:17:04] 06SRE, 06Data-Persistence, 06serviceops, 07Datacenter-Switchover: Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908#10136091 (10Scott_French) 05Open→03Resolved [23:19:39] (03CR) 10Cwhite: [C:03+1] "Exported resource should clean itself up." [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [23:20:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P68863 and previous config saved to /var/cache/conftool/dbconfig/20240910-232026-ladsgroup.json [23:22:14] (03CR) 10Cwhite: "Were the two commits in the relation chain meant to be one commit?" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [23:31:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [23:35:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P68864 and previous config saved to /var/cache/conftool/dbconfig/20240910-233533-ladsgroup.json [23:36:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071978 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071978 (owner: 10TrainBranchBot) [23:46:56] (03CR) 10Jdlrobson: [C:03+1] Turn off feature flag to move donate link everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [23:50:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68865 and previous config saved to /var/cache/conftool/dbconfig/20240910-235040-ladsgroup.json [23:50:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [23:50:44] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:50:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [23:51:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68866 and previous config saved to /var/cache/conftool/dbconfig/20240910-235102-ladsgroup.json