[00:01:00] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#10190074 (10Papaul) [00:02:11] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1076868 [00:02:16] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1076870 [00:02:21] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1076871 [00:02:50] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10190077 (10Papaul) Junos upgrade complete for the system Icinga checks back green. All good on the router, site can be pool back Thanks [00:02:51] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:51] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:57] FIRING: [6x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076858 (owner: 10TrainBranchBot) [00:49:02] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1076870 (owner: 10Ncmonitor) [00:49:29] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1076868 (owner: 10Ncmonitor) [00:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:08:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.25 [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1076883 (https://phabricator.wikimedia.org/T375656) [01:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.25 [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1076883 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [01:09:09] (03CR) 10Dzahn: [C:03+1] "This has manager approval on the ticket now." [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [01:12:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10190163 (10Dzahn) 05Stalled→03Open Thank you! Unstalling ticket. [01:19:29] (03PS1) 10Dzahn: requesttracker: set firewall_srange to [] in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1076885 [01:19:46] (03CR) 10Dzahn: [C:03+2] requesttracker: set firewall_srange to [] in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1076885 (owner: 10Dzahn) [01:23:35] (03CR) 10Dzahn: [C:03+2] "testing result when setting srange to [] while setting a src_set -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076885" [puppet] - 10https://gerrit.wikimedia.org/r/1076846 (owner: 10Dzahn) [01:24:33] (03CR) 10Dzahn: [C:03+2] "This fixes the use of ferm::service -> "Could not find resource 'File[/etc/ferm/conf.d]'" but it causes a "Duplicate declaration: Firewall" [puppet] - 10https://gerrit.wikimedia.org/r/1076885 (owner: 10Dzahn) [01:26:30] (03CR) 10Dzahn: [C:03+1] envoy: Add support for passing an array of sets to the firewall service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [01:33:33] (03PS1) 10Dzahn: requesttracker: comment out firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1076888 [01:33:53] (03CR) 10Dzahn: [C:03+2] requesttracker: comment out firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1076888 (owner: 10Dzahn) [01:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:44:13] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.25 [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1076883 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [01:48:31] (03PS1) 10Dzahn: tlsproxy::envoy: do not use ferm::service if firewall_src_sets is set [puppet] - 10https://gerrit.wikimedia.org/r/1076889 [01:51:39] (03CR) 10Dzahn: "@mmuhlenhoff@wikimedia.org Either this or we just give the resource a different name when it's used with src_sets? Unsure right now if we " [puppet] - 10https://gerrit.wikimedia.org/r/1076889 (owner: 10Dzahn) [01:53:53] (03PS2) 10Dzahn: tlsproxy::envoy: do not use ferm::service if firewall_src_sets is set [puppet] - 10https://gerrit.wikimedia.org/r/1076889 [01:54:21] (03CR) 10Dzahn: [C:03+1] envoy: Add support for passing an array of sets to the firewall service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [01:58:29] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376094 (10phaultfinder) 03NEW [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0200) [02:08:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:21:25] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:22:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:35:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:33] (03PS1) 10RLazarus: deployment_server: Print logs command when mwscript-k8s --attach fails [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) [02:53:16] (03CR) 10CI reject: [V:04-1] deployment_server: Print logs command when mwscript-k8s --attach fails [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) (owner: 10RLazarus) [02:53:29] (03CR) 10BPirkle: REST: Make experimental endpoints available on beta and testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [02:54:33] (03PS2) 10RLazarus: deployment_server: Print logs command when mwscript-k8s --attach fails [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) [02:58:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0300) [03:01:38] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076894 (https://phabricator.wikimedia.org/T375656) [03:01:39] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076894 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [03:02:21] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076894 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [03:02:44] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.43.0-wmf.25 refs T375656 [03:02:51] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [03:04:57] FIRING: [6x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:35] PROBLEM - mysqld processes on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:07:19] PROBLEM - MariaDB Replica Lag: s2 on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:29] PROBLEM - MariaDB Replica IO: s2 on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:29] PROBLEM - MariaDB Replica SQL: s2 on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:18:07] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:28:07] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:51:21] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.43.0-wmf.25 refs T375656 (duration: 48m 36s) [03:51:27] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0400) [04:01:00] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.22 (duration: 00m 58s) [04:05:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 54 probes of 784 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:10:55] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 34 probes of 784 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:26:35] PROBLEM - Host elastic1064 is DOWN: PING CRITICAL - Packet loss = 100% [04:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:07:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:15:48] (03PS4) 10Ryan Kemper: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [05:20:25] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:23:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:28:47] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:35:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:37:44] (03PS5) 10Ryan Kemper: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [05:37:58] (03CR) 10Ryan Kemper: wdqs-categories: introduce VM for testing (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [05:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0600) [06:00:05] marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:42:30] (03PS1) 10Muehlenhoff: Remove access for lbowmaker [puppet] - 10https://gerrit.wikimedia.org/r/1076898 [06:43:13] (03CR) 10Joal: [C:03+1] "Removing my -1 as the HDFS space problem has been raised. The mitigation plan is to reclaim some space where possible, monitor growing dat" [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [06:44:04] !log cr3-ulsfo> request vmhost snapshot - T375345 [06:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:14] T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 [06:46:47] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Luke Bowmaker out of all services on: 1497 hosts [06:47:31] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Luke Bowmaker out of all services on: 1497 hosts [06:47:40] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Luke Bowmaker out of all services on: 705 hosts [06:47:56] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Luke Bowmaker out of all services on: 705 hosts [06:55:52] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10190475 (10ayounsi) 05Open→03Resolved Thanks, all is good now ! [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0700) [07:00:05] abijeet and Melos: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] I can deploy abijeet's patch. [07:00:41] thanks, kart_ [07:01:01] (03PS1) 10Muehlenhoff: Remove puppetserver1002 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1076899 (https://phabricator.wikimedia.org/T376058) [07:01:14] * Melos around [07:01:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075189 (https://phabricator.wikimedia.org/T372460) (owner: 10Abijeet Patro) [07:02:39] (03Merged) 10jenkins-bot: Enable translation settings banner for Test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075189 (https://phabricator.wikimedia.org/T372460) (owner: 10Abijeet Patro) [07:03:08] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1075189|Enable translation settings banner for Test wikipedia (T372460)]] [07:03:15] T372460: Enable translation setting feature including banner on production wikis - https://phabricator.wikimedia.org/T372460 [07:05:12] FIRING: [4x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:01] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1075189|Enable translation settings banner for Test wikipedia (T372460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:09:07] T372460: Enable translation setting feature including banner on production wikis - https://phabricator.wikimedia.org/T372460 [07:09:45] abijeet: patch ready for testing on mwdebug servers. [07:09:51] kart_, ok, checking [07:13:43] kart_, looks ok [07:14:39] cool. going ahead. [07:14:43] !log kartik@deploy2002 kartik, abi: Continuing with sync [07:21:24] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075189|Enable translation settings banner for Test wikipedia (T372460)]] (duration: 18m 15s) [07:21:30] T372460: Enable translation setting feature including banner on production wikis - https://phabricator.wikimedia.org/T372460 [07:21:32] abijeet: done [07:21:39] Melos: around? [07:21:40] kart_, thanks, will do another check [07:22:03] kart_: I'm here [07:22:29] OK. Let's deploy your patch. Will you be able to test your patch once deployed on the testservers? [07:23:01] Yes, I can [07:23:08] cool. Going ahead. [07:23:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076661 (https://phabricator.wikimedia.org/T375979) (owner: 10Melos) [07:24:04] (03PS1) 10Hashar: deployment server: ease replying to jobs emails [puppet] - 10https://gerrit.wikimedia.org/r/1076904 [07:24:06] (03Merged) 10jenkins-bot: Add namespace aliases for scn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076661 (https://phabricator.wikimedia.org/T375979) (owner: 10Melos) [07:24:21] (03PS2) 10KartikMistry: Section Translation: Add mos, kde and rsk Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) [07:24:23] (03CR) 10CI reject: [V:04-1] deployment server: ease replying to jobs emails [puppet] - 10https://gerrit.wikimedia.org/r/1076904 (owner: 10Hashar) [07:24:25] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076904 (owner: 10Hashar) [07:24:31] NO WAY [07:24:33] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1076661|Add namespace aliases for scn.wikipedia (T375979)]] [07:24:39] T375979: Namespace aliases for scn.wikipedia - https://phabricator.wikimedia.org/T375979 [07:25:34] (03PS2) 10Hashar: deployment server: ease replying to jobs emails [puppet] - 10https://gerrit.wikimedia.org/r/1076904 [07:25:44] i swear that at some point I was using leading commas instead of trailing ones [07:25:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) (owner: 10KartikMistry) [07:26:13] maybe that was during Turbo Pascal era [07:26:23] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076904 (owner: 10Hashar) [07:26:47] !log kartik@deploy2002 kartik, melos: Backport for [[gerrit:1076661|Add namespace aliases for scn.wikipedia (T375979)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:27:08] hashar: are we rewriting mediawiki in turbo pascal? [07:27:36] Melos: you can test your change on mwdebug servers. Let me know if everything is OK. [07:28:00] Checking... [07:28:47] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:29:36] (03PS1) 10Muehlenhoff: profile::envoy: When adding rules based on nftables check for empty ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/1076905 [07:29:48] kart_: everything is fine [07:29:58] Awesome. Deploying! [07:30:03] !log kartik@deploy2002 kartik, melos: Continuing with sync [07:30:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [07:34:38] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076661|Add namespace aliases for scn.wikipedia (T375979)]] (duration: 10m 05s) [07:34:44] T375979: Namespace aliases for scn.wikipedia - https://phabricator.wikimedia.org/T375979 [07:35:55] Melos: done. [07:36:15] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10190518 (10ayounsi) [07:36:21] kart_: Thank you. It works fine also in production [07:36:21] (03PS3) 10Hashar: deployment server: ease replying to jobs emails [puppet] - 10https://gerrit.wikimedia.org/r/1076904 [07:36:21] (03PS1) 10Hashar: systemd::timer::job: relax send_mail_from parameter [puppet] - 10https://gerrit.wikimedia.org/r/1076910 [07:39:24] (03CR) 10CI reject: [V:04-1] systemd::timer::job: relax send_mail_from parameter [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar) [07:39:33] (03CR) 10Hashar: "Ben you have introduced the `send_mail_from` as an Email and that prevents me from setting it to `MediaWiki Train !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: T374215 [07:40:03] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:40:26] (03CR) 10Hashar: Allow systemd::timer::job to send from a custom address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [07:43:23] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: T374215 [07:43:56] (03PS1) 10Urbanecm: DatabaseMentorStore: Cast user IDs to integers before looking them up [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076969 (https://phabricator.wikimedia.org/T375784) [07:44:30] hi kart_! are you done with deploying? [07:44:54] (actually... looks like it's train in 10 mins, might not be the best time to do a deployment...) [07:46:31] (03PS4) 10Muehlenhoff: Remove irc1001/irc2001 from mediawiki-config and add irc1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) [07:51:16] urbanecm: go ahead :) [07:51:19] I can delay the train [07:51:22] jouncebot: nowandnext [07:51:22] For the next 0 hour(s) and 8 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0700) [07:51:23] In 0 hour(s) and 8 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0800) [07:51:31] (03CR) 10Urbanecm: [C:03+2] DatabaseMentorStore: Cast user IDs to integers before looking them up [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076969 (https://phabricator.wikimedia.org/T375784) (owner: 10Urbanecm) [07:51:42] Thanks hashar. [07:52:13] ah [07:52:21] well that one is going to take 30+ minutes [07:52:22] :/ [07:52:22] (03CR) 10Michael Große: [C:03+1] DatabaseMentorStore: Cast user IDs to integers before looking them up [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076969 (https://phabricator.wikimedia.org/T375784) (owner: 10Urbanecm) [07:52:39] hashar: i can still cancel the +2 and let you go [07:52:48] na it is ok [07:53:41] really, up2you [07:53:55] ...for some reasons i thought gate for extensions takes less [07:53:56] I ll just delay the train it is fine [07:54:09] CI has some slowness issues this morning, though I haven't investigated [07:54:39] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382 [07:54:45] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [07:55:13] 06SRE, 06DBA, 10Sustainability (Incident Followup), 07Wikimedia-production-error: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10190545 (10jcrespo) 05Resolved→03In progress a:05Jclark-ctr→03None Let's not close it yet, pc1013 is still not working as intended. [07:55:42] 00:07:34.873 You should really speed up these slow tests (>100ms)... [07:55:42] 00:07:34.873 1. 2755ms to run MediaWiki\\Extension\\DiscussionTools\\Tests\\ContentThreadItemTest::testGetHTML with data set #2 [07:55:42] 00:07:34.873 2. 2687ms to run MediaWiki\\Extension\\DiscussionTools\\Tests\\ContentThreadItemTest::testGetText with data set #2 [07:55:42] 00:07:34.873 3. 1633ms to run MediaWiki\\Extension\\DiscussionTools\\Tests\\CommentModifierTest::testAddListItem with data set #4 [07:55:44] * hashar whistles [07:57:42] (03CR) 10Jelto: [C:03+2] gerrit: remove bad_browser IPs added >=10 years ago [puppet] - 10https://gerrit.wikimedia.org/r/1076788 (owner: 10Dzahn) [07:58:23] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382 [08:00:04] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T0800) [08:00:43] not my tests [08:00:52] but nearly 3s is quite a lot, i agree [08:02:38] (03PS2) 10Hashar: systemd::timer::job: relax send_mail_from parameter [puppet] - 10https://gerrit.wikimedia.org/r/1076910 [08:05:39] the model is a bit broken to be fair [08:05:47] blindly running every single tests does not scale :] [08:06:38] true, but we've seen weird breakages in the past too [08:06:50] but maybe running on timer and not on commit makes more sense? [08:07:31] that causes other issues [08:07:37] the tests would always be broken [08:07:50] and there is no incentive for devs to make them pass [08:08:04] but surely we should run a subset of tests instead of the naive approach of running everything [08:08:30] i'm really curious how would we make sure the tests we don't run pass [08:08:48] have we reverted the change to run phpunit tests in parallel again, or is wmf-gate-and-submit the one where we did not enable it yet in the first place? [08:08:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:10:34] urbanecm: that is the million dollars question :D But there are surely tests that do not have to be run since we can know ahead of time they are not affected by a given patch [08:10:47] > i'm really curious how would we make sure the tests we don't run pass -- we could rotate the subset of tests that we run, say based on which quarter of the minute the test is started [08:11:13] then it is not really predictable and you have flappy tests :] [08:11:21] hashar: heh, i kind of thought your answer would start the way it did [08:11:35] hehe [08:12:58] I mean, if there is a test failure, then we need a good revert strategy. Alternatively, we could, in addition for that _require _ that there was a Verified +2 before we start gate-and-submit [08:12:58] we could make mediawiki properly modular, but...that's realistically only going to be a wish [08:13:07] but I agree that there are tricky tradeoffs here [08:13:15] (03PS14) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [08:13:15] there are also some oddities such as a build failing after 47 minutes only due to: 00:47:27.855 Message 'growthexperiments-homepage-impact-subheader-text' required by 'ext.growthExperiments.Homepage.NewImpact' must exist [08:13:31] which sounds simple enough it should run first. That test is from the structure tests in Mediawiki iirc [08:13:35] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:13:36] (03CR) 10CI reject: [V:04-1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:13:56] looks like `ResourcesTest::testMissingMessages` now requires the database bah [08:14:20] ah no it is a RL test .. [08:14:20] (03CR) 10Alexandros Kosiaris: [C:04-1] "First pass as well, various inline comments. Overall approach looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:15:39] (03PS15) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [08:15:53] (03CR) 10Ayounsi: "Would it be possible instead to define a new type like "Extendedemail" to help prevent typoes in that new string ?" [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar) [08:20:26] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:22:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:23:11] (03CR) 10JMeybohm: Initial commit of containerd puppet code (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:23:40] (03PS16) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [08:23:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076969 (https://phabricator.wikimedia.org/T375784) (owner: 10Urbanecm) [08:25:32] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:27:07] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:29:53] (03CR) 10Jelto: [C:03+2] devtools/hiera: replace legacy facts for puppet 8 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1074491 (owner: 10Dzahn) [08:32:09] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 49 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:35:39] 10SRE-swift-storage, 06Commons, 06Traffic-Icebox: Certain SVG images on Commons are served as text instead of SVG - https://phabricator.wikimedia.org/T375324#10190627 (10Aklapper) [08:36:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:37:07] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:39:12] (03Merged) 10jenkins-bot: DatabaseMentorStore: Cast user IDs to integers before looking them up [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076969 (https://phabricator.wikimedia.org/T375784) (owner: 10Urbanecm) [08:39:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1076969|DatabaseMentorStore: Cast user IDs to integers before looking them up (T375784)]] [08:39:45] First step [08:39:51] T375784: Getting a full list of mentors timeouts for long-term mentors (Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded) - https://phabricator.wikimedia.org/T375784 [08:40:53] (03CR) 10FNegri: "These settings were introduced in If5bf8ea98ee73b8426e1b1dc1321d159c4d2f67d to "uncap the network", can they be useful in other projects? " [puppet] - 10https://gerrit.wikimedia.org/r/1076785 (owner: 10Majavah) [08:41:04] (03CR) 10Hashar: "**TLDR: no if we don't know what still relies solely on RSA!**" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:41:43] (03CR) 10Hashar: [C:04-1] gerrit: Remove rsa-2048 certs from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:46:24] (03CR) 10Hashar: "Feel free to amend this change or do it another change. My original intent in I1789c7c7cd407144bea8dd84c619865a363dfdb4 was to set the `F" [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar) [08:46:43] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076969|DatabaseMentorStore: Cast user IDs to integers before looking them up (T375784)]] (duration: 06m 58s) [08:46:50] T375784: Getting a full list of mentors timeouts for long-term mentors (Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded) - https://phabricator.wikimedia.org/T375784 [08:47:06] hashar: deployed, thanks again for your patience [08:47:39] API no longer timeouts :) [08:49:19] awesome! [08:49:23] thank you for fixing the site [08:50:47] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076972 (https://phabricator.wikimedia.org/T375656) [08:50:48] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076972 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [08:51:30] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076972 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [08:51:48] (03PS1) 10JMeybohm: kubernetes::worker_containerd: Add registry 'secrets' [labs/private] - 10https://gerrit.wikimedia.org/r/1076973 (https://phabricator.wikimedia.org/T362408) [08:54:37] (03CR) 10JMeybohm: [C:03+1] Revert "experiment w/ externalIPs on staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1076837 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [08:54:56] (03CR) 10JMeybohm: [V:03+2 C:03+2] kubernetes::worker_containerd: Add registry 'secrets' [labs/private] - 10https://gerrit.wikimedia.org/r/1076973 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:55:25] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:58:17] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.25 refs T375656 [08:58:25] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [08:58:50] * hashar checks logs [09:00:41] moritzm: go ahead with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1050261 :) [09:00:46] train looks quiet so far [09:02:55] (03CR) 10Btullis: [C:03+2] [analytics][webrequest] Extend retention for unique devices analysis [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [09:05:54] (03PS17) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [09:06:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:06:10] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:06:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:06:43] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:06:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:07:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:07:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T367856)', diff saved to https://phabricator.wikimedia.org/P69437 and previous config saved to /var/cache/conftool/dbconfig/20241001-090708-ladsgroup.json [09:07:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:09:41] (03PS2) 10Elukey: admin_ng: drop unused aux-k8s control plane IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) [09:11:52] (03PS1) 10Slyngshede: Menu: Allow users on mobile to close the menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076974 (https://phabricator.wikimedia.org/T376108) [09:14:24] (03PS1) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 [09:14:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jmm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:15:34] (03Merged) 10jenkins-bot: Remove irc1001/irc2001 from mediawiki-config and add irc1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:15:59] !log jmm@deploy2002 Started scap sync-world: Backport for [[gerrit:1050261|Remove irc1001/irc2001 from mediawiki-config and add irc1003 (T331702 T376014)]] [09:16:11] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [09:16:11] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [09:18:41] !log jmm@deploy2002 jmm: Backport for [[gerrit:1050261|Remove irc1001/irc2001 from mediawiki-config and add irc1003 (T331702 T376014)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:18:46] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [09:18:53] T375223: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [09:19:00] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [09:19:35] !log jmm@deploy2002 jmm: Continuing with sync [09:24:06] !log jmm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1050261|Remove irc1001/irc2001 from mediawiki-config and add irc1003 (T331702 T376014)]] (duration: 08m 07s) [09:24:13] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [09:24:14] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [09:27:05] (03PS4) 10Elukey: role::ircstream: add template for config file plus basic settings [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) [09:27:30] (03CR) 10JMeybohm: [C:03+1] admin_ng: drop unused aux-k8s control plane IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:27:54] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4159/console" [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:29:10] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4160/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:29:56] (03CR) 10Volans: "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 (owner: 10Elukey) [09:32:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:34:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 1.362s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:38:44] (03CR) 10Elukey: [C:03+2] admin_ng: drop unused aux-k8s control plane IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:39:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 1.362s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:40:49] (03CR) 10Elukey: [V:03+1 C:03+2] role::ircstream: add template for config file plus basic settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:42:09] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 196, down: 7, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:45:11] PROBLEM - Host lists1004 is DOWN: CRITICAL - Host Unreachable (208.80.154.81) [09:46:29] RECOVERY - Host lists1004 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:53:56] (03CR) 10Muehlenhoff: [C:03+2] Use nftables for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1076781 (owner: 10Muehlenhoff) [09:55:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [09:56:47] (03PS2) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 [09:57:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:59:45] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1000) [10:00:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [10:01:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:02:13] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:03:08] (03PS3) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 [10:04:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [10:05:51] jouncebot: now [10:05:51] For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1000) [10:05:58] (03PS4) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 [10:06:02] jouncebot: next [10:06:03] In 1 hour(s) and 53 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1200) [10:06:39] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:07:10] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:10:09] 06SRE, 06Infrastructure-Foundations: codfw: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376119 (10elukey) 03NEW [10:11:19] 06SRE, 06Infrastructure-Foundations: codfw: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376119#10190898 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+-------+-----... [10:11:35] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host irc2003.wikimedia.org [10:11:37] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [10:13:03] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host an-conf1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:13:21] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:15:11] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2003.wikimedia.org - elukey@cumin1002" [10:15:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2003.wikimedia.org - elukey@cumin1002" [10:15:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:15:15] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache irc2003.wikimedia.org on all recursors [10:15:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc2003.wikimedia.org on all recursors [10:15:34] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:15:44] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc2003.wikimedia.org - elukey@cumin1002" [10:15:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc2003.wikimedia.org - elukey@cumin1002" [10:15:54] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:16:23] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host irc2003.wikimedia.org with OS bookworm [10:17:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:17:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:19:01] jouncebot: nowandnext [10:19:01] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1000) [10:19:01] In 1 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1200) [10:19:34] (03CR) 10CI reject: [V:04-1] [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 (owner: 10Elukey) [10:21:32] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:21:52] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:23:04] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:23:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:24:31] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:24:42] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:25:15] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:25:25] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:26:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dbproxy2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:26:11] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:26:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:26:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:17] FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:06] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121 (10elukey) 03NEW [10:29:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52776 bytes in 0.466 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:29:46] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [10:29:53] FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:25] RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:56] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:32:10] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:33:31] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:33:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:35:34] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host krb1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:35:45] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host krb1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:36:32] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host deploy1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:36:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:38:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host parsoidtest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:38:33] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parsoidtest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:39:28] (03PS21) 10Effie Mouzeli: wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [10:40:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:40:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [10:41:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:41:05] (03CR) 10Kamila Součková: [C:03+1] admin: add zoe to deployment (move from ldap_only_users) [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [10:41:32] (03Merged) 10jenkins-bot: wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [10:41:52] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:41:56] !log jiji@deploy2002 Started scap sync-world: Backport for [[gerrit:1059339|wikitech: de-wikitech mediawiki-config (T371537 T371592 T371374 T371359)]] [10:42:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:42:06] T371537: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537 [10:42:07] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [10:42:08] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [10:42:09] T371359: Migrate Wikitech's Jobqueue - https://phabricator.wikimedia.org/T371359 [10:42:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10190999 (10kamila) [10:42:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:42:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2011.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:11] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:23] !log jiji@deploy2002 jiji: Backport for [[gerrit:1059339|wikitech: de-wikitech mediawiki-config (T371537 T371592 T371374 T371359)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:47:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10191019 (10WMDECyn) Request Approved [10:47:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10191021 (10elukey) [10:48:21] !log jiji@deploy2002 Sync cancelled. [10:52:52] !log jiji@deploy2002 Started scap sync-world: Backport for [[gerrit:1059339|wikitech: de-wikitech mediawiki-config (T371537 T371592 T371374 T371359)]] [10:53:00] T371537: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537 [10:53:00] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [10:53:01] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [10:53:01] T371359: Migrate Wikitech's Jobqueue - https://phabricator.wikimedia.org/T371359 [10:55:17] !log jiji@deploy2002 jiji: Backport for [[gerrit:1059339|wikitech: de-wikitech mediawiki-config (T371537 T371592 T371374 T371359)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:56:17] !log jiji@deploy2002 jiji: Continuing with sync [11:00:20] (03Abandoned) 10Effie Mouzeli: (DNM) trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1059118 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:01:16] !log jiji@deploy2002 Finished scap sync-world: Backport for [[gerrit:1059339|wikitech: de-wikitech mediawiki-config (T371537 T371592 T371374 T371359)]] (duration: 08m 23s) [11:01:35] T371537: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537 [11:01:36] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [11:01:36] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [11:01:37] T371359: Migrate Wikitech's Jobqueue - https://phabricator.wikimedia.org/T371359 [11:01:53] (03CR) 10Elukey: [C:03+1] "LGTM! I think it is in a good shape to be used/tested, then we'll see if anything pops up. Feel free to merge once CI likes you again :)" [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [11:02:15] (03PS1) 10Effie Mouzeli: trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) [11:02:36] (03CR) 10CI reject: [V:04-1] trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:03:31] (03PS2) 10Effie Mouzeli: trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) [11:05:27] FIRING: [4x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:47] (03PS3) 10Dzahn: gitlab: replace legacy Hiera facts with newer syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074493 [11:06:52] (03PS1) 10Elukey: Add config for irc2003 [puppet] - 10https://gerrit.wikimedia.org/r/1076989 (https://phabricator.wikimedia.org/T376119) [11:07:28] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4163/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [11:07:30] (03CR) 10Elukey: [C:03+2] Add config for irc2003 [puppet] - 10https://gerrit.wikimedia.org/r/1076989 (https://phabricator.wikimedia.org/T376119) (owner: 10Elukey) [11:08:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:09:10] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc2003.wikimedia.org with reason: host reimage [11:10:54] (03PS1) 10Muehlenhoff: ircstream: Switch to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1076991 [11:11:03] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:11:03] (03CR) 10Vgutierrez: [C:04-1] "please remove wikitech.wm.o mentions in role/common/text.yaml as well" [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:11:25] (03CR) 10Vgutierrez: [C:04-1] "role/common/cache/text.yaml sorry" [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:11:27] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: replace legacy Hiera facts with newer syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [11:12:03] (03CR) 10Vgutierrez: [C:03+1] "forget it.. brain fart :)" [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:12:43] (03PS1) 10Ladsgroup: wikitech: Soft connect wikitech to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076992 (https://phabricator.wikimedia.org/T161859) [11:12:50] (03PS3) 10Effie Mouzeli: trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) [11:12:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc2003.wikimedia.org with reason: host reimage [11:14:25] (03CR) 10Ladsgroup: "The config diff: https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/2208/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076992 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [11:15:17] (03CR) 10Vgutierrez: [C:03+1] "on the other hand, setting caching: normal is a NOOP, so PS3 works and it's cleaner IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:16:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076991 (owner: 10Muehlenhoff) [11:16:23] !log Switching wikitech to k8s - T292707 [11:16:36] (03CR) 10Effie Mouzeli: [C:03+2] trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1076987 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [11:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:39] T292707: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 [11:20:32] effie: 🎉 [11:21:29] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:23:08] (03PS1) 10Ladsgroup: Drop wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) [11:23:53] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:28:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host irc2003.wikimedia.org with OS bookworm [11:28:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc2003.wikimedia.org [11:28:46] (03CR) 10Effie Mouzeli: [C:03+1] "woohoo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) (owner: 10Ladsgroup) [11:35:29] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:35:37] (03CR) 10Hnowlan: [C:03+1] shellbox: add support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074494 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [11:36:07] 06SRE, 06Infrastructure-Foundations: codfw: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376119#10191235 (10elukey) 05Open→03Resolved a:03elukey [11:40:12] (03PS1) 10Ladsgroup: Wikitech: Connect wikitech to external storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077000 (https://phabricator.wikimedia.org/T376129) [11:41:03] (03PS2) 10Ladsgroup: Drop wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) [11:41:07] (03CR) 10Ladsgroup: [C:03+2] Drop wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) (owner: 10Ladsgroup) [11:41:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) (owner: 10Ladsgroup) [11:41:50] (03Merged) 10jenkins-bot: Drop wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076995 (https://phabricator.wikimedia.org/T371592) (owner: 10Ladsgroup) [11:42:14] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1076995|Drop wikitech.php (T371592 T371374)]] [11:42:22] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [11:42:22] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [11:44:24] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1076995|Drop wikitech.php (T371592 T371374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:45:16] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:49:47] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076995|Drop wikitech.php (T371592 T371374)]] (duration: 07m 32s) [11:49:50] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [11:49:51] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [11:51:00] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:51:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076992 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [11:51:39] (03Merged) 10jenkins-bot: wikitech: Soft connect wikitech to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076992 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [11:52:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1076992|wikitech: Soft connect wikitech to SUL (T161859)]] [11:52:07] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [11:54:21] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1076992|wikitech: Soft connect wikitech to SUL (T161859)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:57:22] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1200) [12:00:11] (03PS1) 10Elukey: services: add irc2003 to the MW's network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077003 (https://phabricator.wikimedia.org/T376014) [12:01:58] (03CR) 10D3r1ck01: [C:03+2] "beta-only patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) (owner: 10D3r1ck01) [12:01:58] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076992|wikitech: Soft connect wikitech to SUL (T161859)]] (duration: 09m 53s) [12:02:04] (03PS1) 10Elukey: Add irc2003 to the irc settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077004 (https://phabricator.wikimedia.org/T376014) [12:02:09] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [12:02:43] (03Merged) 10jenkins-bot: [beta-cluster] Enable cookie-based SUL3 feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) (owner: 10D3r1ck01) [12:03:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077000 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [12:06:03] (03PS2) 10Ladsgroup: Wikitech: Connect wikitech to external storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077000 (https://phabricator.wikimedia.org/T376129) [12:06:12] (03CR) 10Ladsgroup: [C:03+2] "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077000 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [12:06:57] (03Merged) 10jenkins-bot: Wikitech: Connect wikitech to external storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077000 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [12:07:22] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1077000|Wikitech: Connect wikitech to external storage (T376129)]] [12:07:24] T376129: Database clean ups after migration of wikitech to production - https://phabricator.wikimedia.org/T376129 [12:09:38] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1077000|Wikitech: Connect wikitech to external storage (T376129)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:11:08] (03PS1) 10Ladsgroup: wikitech: Allow 'crats to rename local users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077006 (https://phabricator.wikimedia.org/T161859) [12:12:40] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:17:15] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077000|Wikitech: Connect wikitech to external storage (T376129)]] (duration: 09m 53s) [12:17:18] T376129: Database clean ups after migration of wikitech to production - https://phabricator.wikimedia.org/T376129 [12:17:36] (03PS2) 10Samtar: IS-labs: Enable wgUseCodexSpecialBlock on test.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) [12:17:39] (03CR) 10Ladsgroup: [C:03+2] wikitech: Allow 'crats to rename local users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077006 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [12:17:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077006 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [12:19:36] jouncebot: nowandnext [12:19:37] For the next 0 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1200) [12:19:37] In 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1300) [12:20:03] (03Merged) 10jenkins-bot: wikitech: Allow 'crats to rename local users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077006 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [12:20:30] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1077006|wikitech: Allow 'crats to rename local users (T161859)]] [12:20:33] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [12:21:12] (03PS1) 10Effie Mouzeli: dsh: remove cloudweb hosts from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) [12:21:45] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) (owner: 10Effie Mouzeli) [12:22:46] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1077006|wikitech: Allow 'crats to rename local users (T161859)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:23:17] (03PS2) 10Effie Mouzeli: dsh: remove cloudweb hosts from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) [12:23:21] !log mwscript maintenance/storage/moveToExternal.php --wiki=labswiki --undo /home/ladsgroup/T376129.undo.sql DB cluster31 (T376129) [12:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:24] T376129: Database clean ups after migration of wikitech to production - https://phabricator.wikimedia.org/T376129 [12:23:51] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:23:55] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) (owner: 10Effie Mouzeli) [12:27:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077003 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [12:27:22] (03CR) 10Slyngshede: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077004 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [12:27:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077004 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [12:28:22] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077006|wikitech: Allow 'crats to rename local users (T161859)]] (duration: 07m 51s) [12:28:25] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [12:35:04] (03PS3) 10Effie Mouzeli: dsh: remove cloudweb hosts from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) [12:37:01] (03CR) 10Giuseppe Lavagetto: [C:03+1] dsh: remove cloudweb hosts from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) (owner: 10Effie Mouzeli) [12:37:12] (03CR) 10Effie Mouzeli: [C:03+2] dsh: remove cloudweb hosts from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1077008 (https://phabricator.wikimedia.org/T371374) (owner: 10Effie Mouzeli) [12:43:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 1.215s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:44:55] (03CR) 10CDanis: [C:03+2] Revert "experiment w/ externalIPs on staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1076837 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:48:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int (k8s) 1.215s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:55:59] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10191609 (10aborrero) I'm sorry I completely missed the pings here. I will get this done today. [12:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:50] nothing to deploy :) [13:03:11] great :) [13:04:57] FIRING: [4x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:38] (03CR) 10Elukey: [C:03+1] Remove puppetserver1002 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1076899 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [13:13:36] (03CR) 10Elukey: [C:03+2] Add aux-k8s-{ctrl,worker}1003 to AUX K8s [puppet] - 10https://gerrit.wikimedia.org/r/1076679 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [13:16:56] (03PS6) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [13:19:56] (03PS3) 10Samtar: IS-labs: Enable wgUseCodexSpecialBlock on test.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) [13:21:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) (owner: 10Samtar) [13:21:25] FIRING: [2x] SystemdUnitFailed: docker.service on aux-k8s-ctrl1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:51] (03Merged) 10jenkins-bot: IS-labs: Enable wgUseCodexSpecialBlock on test.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) (owner: 10Samtar) [13:21:58] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:22:28] (03CR) 10JMeybohm: [C:03+1] services: add irc2003 to the MW's network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077003 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [13:26:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on aux-k8s-ctrl1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:29] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10191716 (10Papaul) 05Resolved→03Open a:05ayounsi→03Papaul I have to update netbox with the inventory and new serial number [13:27:50] FIRING: [3x] KubernetesCalicoDown: aux-k8s-ctrl1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:31:31] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1003.eqiad.wmnet [13:31:36] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker1003.eqiad.wmnet [13:32:13] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-ctrl1003.eqiad.wmnet [13:32:31] !log elukey@puppetserver1001 conftool action : set/weight=1; selector: name=aux-k8s-ctrl1003.eqiad.wmnet [13:37:50] FIRING: [3x] KubernetesCalicoDown: aux-k8s-ctrl1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:41:27] new nodes --^ [13:49:36] (03CR) 10Btullis: dse-k8s: Add service configuration for airflow-analytics-test (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:49:44] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=aux-k8s-ctrl1003.eqiad.wmnet [13:49:48] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=aux-k8s-worker1003.eqiad.wmnet [13:50:22] (03PS1) 10AikoChou: ml-services: update ref-need model and increase cpu and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077024 (https://phabricator.wikimedia.org/T371902) [13:51:07] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:23] (03PS1) 10Ssingh: Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 [13:57:58] (03PS1) 10AikoChou: ml-services: fix indentation in articlequality resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077027 [13:59:01] (03PS6) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [13:59:40] (03PS1) 10CDanis: parametrize test_hiera_lookup on fmt [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077028 [14:01:05] (03CR) 10Bking: wdqs-categories: introduce VM for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [14:04:50] (03PS1) 10Lucas Werkmeister (WMDE): DNM: Empty patch to test schedule-deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077030 [14:05:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077030 (owner: 10Lucas Werkmeister (WMDE)) [14:05:44] (03Abandoned) 10Lucas Werkmeister (WMDE): DNM: Empty patch to test schedule-deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077030 (owner: 10Lucas Werkmeister (WMDE)) [14:08:53] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:10:31] !log installing cups security updates [14:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] RECOVERY - SSH on ml-serve2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:11:43] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.52 ms [14:12:20] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 369, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 287, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:42] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10191898 (10Jhancock.wm) [14:15:22] !log added the label node-role.kubernetes.io/control-plane='' to all k8s apiservers - T334234 [14:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:25] T334234: Migrate to node-role.kubernetes.io/control-plane label/taint - https://phabricator.wikimedia.org/T334234 [14:17:50] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:13] (03PS4) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [14:18:48] (03CR) 10Klausman: [C:03+1] ml-services: update ref-need model and increase cpu and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077024 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [14:19:00] (03CR) 10Klausman: [C:03+1] ml-services: fix indentation in articlequality resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077027 (owner: 10AikoChou) [14:20:33] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:20:46] (03PS7) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [14:20:46] (03PS5) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [14:22:53] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:24:51] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10191940 (10MoritzMuehlenhoff) [14:25:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:25:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:29:17] (03PS1) 10JMeybohm: kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) [14:29:27] (03PS1) 10Btullis: airflow: Use the analytics user for task executor pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077037 (https://phabricator.wikimedia.org/T375895) [14:30:06] (03PS2) 10JMeybohm: kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) [14:30:15] (03PS3) 10Snwachukwu: Change New Eventschemas Git URLs [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) [14:30:27] (03CR) 10CI reject: [V:04-1] kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) (owner: 10JMeybohm) [14:31:37] (03PS2) 10Ssingh: Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 [14:32:02] (03PS2) 10Btullis: airflow: Use the analytics user for task executor pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077037 (https://phabricator.wikimedia.org/T375895) [14:32:14] (03CR) 10Snwachukwu: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [14:32:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:32:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:33:57] (03PS7) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [14:33:57] (03PS7) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [14:33:57] (03PS8) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [14:33:58] (03PS6) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [14:33:59] (03PS1) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [14:35:33] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:37:07] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [14:37:19] (03PS3) 10Btullis: airflow: Use the analytics user for task executor pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077037 (https://phabricator.wikimedia.org/T375895) [14:37:49] (03PS2) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [14:38:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:31] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376094#10191982 (10phaultfinder) [14:38:54] (03PS3) 10Ssingh: Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 [14:40:05] (03CR) 10CI reject: [V:04-1] Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 (owner: 10Ssingh) [14:44:00] (03PS9) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [14:44:00] (03PS7) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [14:45:06] (03PS1) 10Majavah: hieradata: Stop monitoring Wikitech on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077040 (https://phabricator.wikimedia.org/T292707) [14:45:08] (03PS1) 10Majavah: openstack: Stop running Wikitech jobs on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077041 (https://phabricator.wikimedia.org/T292707) [14:45:19] !log added the taint node-role.kubernetes.io/control-plane:NoSchedule to wikikube staging apiservers - T334234 [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] T334234: Migrate to node-role.kubernetes.io/control-plane label/taint - https://phabricator.wikimedia.org/T334234 [14:46:18] (03PS4) 10Ssingh: Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 [14:48:32] (03Abandoned) 10Ssingh: sre.dns.roll-restart-haproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1074266 (https://phabricator.wikimedia.org/T375232) (owner: 10CDobbins) [14:49:07] (03CR) 10Vgutierrez: [C:03+1] Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 (owner: 10Ssingh) [14:50:13] (03PS36) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [14:52:12] (03PS37) 10Ssingh: sre.dns.pdns-recursor: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:54:23] (03PS1) 10Cwhite: curator: increase timeout_override for setting replicas [puppet] - 10https://gerrit.wikimedia.org/r/1077042 (https://phabricator.wikimedia.org/T364190) [14:56:14] (03PS38) 10Ssingh: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:56:20] (03PS2) 10Cwhite: curator: increase timeout for setting replicas [puppet] - 10https://gerrit.wikimedia.org/r/1077042 (https://phabricator.wikimedia.org/T364190) [14:58:14] (03PS1) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) [14:59:30] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] eoghan, jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1500). nyaa~ [15:00:50] (03PS2) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) [15:01:29] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:01:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:01:46] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:02:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:02:25] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:02:25] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:03:16] !log brennen@deploy2002 Started deploy [phabricator/deployment@33a2c8d]: test deploy phab2002 for T376149 [15:03:19] T376149: Deploy Phabricator/Phorge 2024-10-01 - https://phabricator.wikimedia.org/T376149 [15:03:38] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:03:39] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:03:46] !log brennen@deploy2002 Finished deploy [phabricator/deployment@33a2c8d]: test deploy phab2002 for T376149 (duration: 00m 30s) [15:04:13] !log brennen@deploy2002 Started deploy [phabricator/deployment@33a2c8d]: deploy phab1004 for T376149 [15:04:19] (03PS3) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) [15:05:21] !log brennen@deploy2002 Finished deploy [phabricator/deployment@33a2c8d]: deploy phab1004 for T376149 (duration: 01m 07s) [15:06:59] (03PS1) 10Majavah: Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) [15:07:03] (03CR) 10Ssingh: [C:03+2] Move PyBal failing BGP sessions alert to team-sre and page [alerts] - 10https://gerrit.wikimedia.org/r/1077025 (owner: 10Ssingh) [15:07:10] (03PS1) 10Gmodena: dse-k8s-services: content_history: update docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) [15:07:13] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1003.eqiad.wmnet [15:07:17] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-ctrl1003.eqiad.wmnet [15:08:27] (03PS1) 10Majavah: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) [15:08:29] (03CR) 10Volans: [C:03+1] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077028 (owner: 10CDanis) [15:10:27] (03CR) 10AikoChou: [C:03+2] ml-services: fix indentation in articlequality resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077027 (owner: 10AikoChou) [15:11:13] (03PS10) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [15:11:13] (03PS8) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [15:11:31] (03Merged) 10jenkins-bot: ml-services: fix indentation in articlequality resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077027 (owner: 10AikoChou) [15:11:57] (03PS2) 10Gmodena: dse-k8s-services: content_history: update docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) [15:13:25] (03PS2) 10Majavah: Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) [15:13:25] (03PS2) 10Majavah: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) [15:13:38] (03PS2) 10AikoChou: ml-services: update ref-need model and increase cpu and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077024 (https://phabricator.wikimedia.org/T371902) [15:14:08] (03CR) 10CI reject: [V:04-1] Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:14:10] (03CR) 10CI reject: [V:04-1] Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:14:12] (03CR) 10Cwhite: [C:03+2] curator: increase timeout for setting replicas [puppet] - 10https://gerrit.wikimedia.org/r/1077042 (https://phabricator.wikimedia.org/T364190) (owner: 10Cwhite) [15:14:48] (03PS3) 10Majavah: Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) [15:14:48] (03PS3) 10Majavah: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) [15:14:59] (03CR) 10AikoChou: [C:03+2] ml-services: update ref-need model and increase cpu and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077024 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [15:16:20] (03CR) 10Papaul: [C:03+2] Arelion IPv6 renumbering [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) (owner: 10Ayounsi) [15:16:32] (03Merged) 10jenkins-bot: ml-services: update ref-need model and increase cpu and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077024 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [15:16:54] (03Merged) 10jenkins-bot: Arelion IPv6 renumbering [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) (owner: 10Ayounsi) [15:18:33] (03CR) 10Vgutierrez: [C:03+1] "Looking good:" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:22:03] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10192209 (10aborrero) [15:23:51] (03CR) 10CDanis: [C:03+2] parametrize test_hiera_lookup on fmt [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077028 (owner: 10CDanis) [15:23:56] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4169/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:24:06] (03PS1) 10Arturo Borrero Gonzalez: cloudlb2004-dev: give it a puppet role [puppet] - 10https://gerrit.wikimedia.org/r/1077049 (https://phabricator.wikimedia.org/T370678) [15:24:47] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10192248 (10elukey) Second day of hackathon! * We created a new VM for codfw traffic (i... [15:25:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudlb2004-dev: give it a puppet role [puppet] - 10https://gerrit.wikimedia.org/r/1077049 (https://phabricator.wikimedia.org/T370678) (owner: 10Arturo Borrero Gonzalez) [15:26:59] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10192267 (10ssingh) Hi @RobH: Is this confirmed for tomorrow Oct 2? [15:27:22] (03PS4) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) [15:27:47] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10192269 (10aborrero) please @Jhancock.wm try now. [15:28:13] (03PS3) 10JMeybohm: kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) [15:28:19] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10192278 (10RobH) Yes, they'll be showing up onsite around 09:00 CET / 00:00 Pacific. We'll want to fully depool and power down these two hosts in advance of their arrival. I figured I would just... [15:29:22] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) (owner: 10JMeybohm) [15:29:57] (03CR) 10Elukey: [C:03+1] kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) (owner: 10JMeybohm) [15:33:49] (03CR) 10Btullis: [C:03+2] airflow: Use the analytics user for task executor pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077037 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [15:33:54] (03PS11) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [15:33:54] (03PS9) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [15:33:54] (03PS8) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [15:34:15] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10192348 (10ssingh) Thanks @RobH, that works for us. @Vgutierrez will depool the two hosts in advance of the event and downtime. [15:34:21] (03Merged) 10jenkins-bot: parametrize test_hiera_lookup on fmt [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077028 (owner: 10CDanis) [15:34:50] (03Merged) 10jenkins-bot: airflow: Use the analytics user for task executor pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077037 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [15:36:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:36:09] (03CR) 10Bugreporter: "Tests are needed to make sure (1) new accounts created in Wikitech are connected to SUL and (2) It is not possible to create accounts that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:36:29] (03PS1) 10Ryan Kemper: wdqs.data-transfer: log now mentions bg instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1077051 (https://phabricator.wikimedia.org/T364077) [15:36:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] deployment_server: Print logs command when mwscript-k8s --attach fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) (owner: 10RLazarus) [15:37:30] (03CR) 10Stevemunene: [C:03+1] wdqs.data-transfer: log now mentions bg instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1077051 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [15:37:35] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] wdqs.data-transfer: log now mentions bg instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1077051 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [15:39:03] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [15:39:21] (03PS1) 10Dzahn: gerrit: raise throttling duration to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1077053 (https://phabricator.wikimedia.org/T375996) [15:39:33] (03PS5) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) [15:39:58] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1077053 (https://phabricator.wikimedia.org/T375996) (owner: 10Dzahn) [15:40:10] (03CR) 10Dzahn: [C:03+2] gerrit: raise throttling duration to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1077053 (https://phabricator.wikimedia.org/T375996) (owner: 10Dzahn) [15:44:49] 06SRE, 10WikimediaDebug, 10wikitech.wikimedia.org: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10192430 (10Bugreporter) 05Open→03Declined Reopen if it is still needed. [15:44:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:44:56] (03CR) 10BCornwall: [C:03+2] archiva: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075605 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:45:47] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [15:46:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Had a quick look ahead of the window tomorrow, LGTM 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076797 (https://phabricator.wikimedia.org/T373821) (owner: 10Daimona Eaytoy) [15:46:51] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10192476 (10Jelto) [15:50:47] (03PS1) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:51:16] (03PS3) 10RLazarus: deployment_server: Print logs command when mwscript-k8s --attach fails [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) [15:51:40] (03CR) 10RLazarus: deployment_server: Print logs command when mwscript-k8s --attach fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) (owner: 10RLazarus) [15:52:33] (03CR) 10Majavah: [C:03+2] hieradata: Remove data for clouddb-services [puppet] - 10https://gerrit.wikimedia.org/r/1076785 (owner: 10Majavah) [15:52:41] jouncebot: nowandnexr [15:52:43] jouncebot: nowandnext [15:52:43] For the next 0 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1500) [15:52:43] In 0 hour(s) and 7 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1600) [15:52:44] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10192543 (10Jelto) 05Open→03Resolved Throttling is active for around one month on a... [15:52:53] (03CR) 10Ladsgroup: [C:03+2] Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:53:59] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, this test transfer should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1023.eqiad.wmnet, repooling both afterwards [15:54:00] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, this test transfer should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1023.eqiad.wmnet, repooling both afterwards [15:54:02] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [15:54:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:54:51] (03Merged) 10jenkins-bot: Make Wikitech behave a bit more like a SUL wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077046 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:55:16] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1077046|Make Wikitech behave a bit more like a SUL wiki (T371374)]] [15:55:21] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [15:56:27] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, this test transfer should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:56:37] (03CR) 10Ladsgroup: "the only thing to double check is that we don't want people to create account with the same username as someone in SUL and squat on it (as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:56:48] (03PS39) 10Ssingh: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [15:57:47] (03CR) 10Ssingh: "DRY-RUN: Skipping conftool update on dry-run mode: {"dns1005.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=d" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [15:58:13] (03CR) 10Ssingh: "DRY-RUN: Skipping conftool update on dry-run mode: {"dns1005.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=d" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [15:58:35] !log ladsgroup@deploy2002 taavi, ladsgroup: Backport for [[gerrit:1077046|Make Wikitech behave a bit more like a SUL wiki (T371374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:58:40] (03CR) 10Majavah: "as long as you run migratePass0 (and maybe migratePass1? not sure) it should be fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [15:59:02] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, this test transfer should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:59:05] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [16:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:08] !log ladsgroup@deploy2002 taavi, ladsgroup: Continuing with sync [16:00:27] Amir1: nothing for the puppet window, it's all yours still [16:00:35] (03PS3) 10Gmodena: dse-k8s-services: content_history: update docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) [16:00:37] thanks [16:01:11] (03PS1) 10Ryan Kemper: wdqs.data-transfer: add missing self. [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) [16:01:53] (03CR) 10Stevemunene: [C:03+1] wdqs.data-transfer: add missing self. [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:03:43] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] wdqs.data-transfer: add missing self. [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:03:47] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:03:48] aikochou@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:04:53] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077046|Make Wikitech behave a bit more like a SUL wiki (T371374)]] (duration: 09m 36s) [16:04:54] ladsgroup@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:04:56] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [16:05:16] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:05:58] (03CR) 10Ladsgroup: "I don't know if we are going to run that now. First we need to migrate usernames." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T371374) (owner: 10Majavah) [16:06:47] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:06:47] aikochou@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:10:40] (03PS2) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [16:11:25] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, test out new flag) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [16:11:27] ryankemper@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:11:27] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [16:12:01] (03CR) 10CI reject: [V:04-1] sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [16:16:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, test out new flag) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [16:16:07] ryankemper@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:17:44] * ryankemper wonders if stashbot needs to be restarted [16:17:57] it's the wikitech SUL migration I think [16:18:10] so it'll need a password reset [16:21:37] https://phabricator.wikimedia.org/T376176 [16:22:19] (cc bd808 ) [16:23:22] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:27:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10192815 (10dcaro) 05Open→03Resolved Done! all three upgraded, setup and joined in the cluster. [16:28:52] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10192849 (10Jhancock.wm) pinging for now. gonna hold to make sure the alerts clear for at least the logging-hd servers [16:28:54] (03PS1) 10Herron: thanos-query: set OTEL_SERVICE_NAME env variable [puppet] - 10https://gerrit.wikimedia.org/r/1077068 (https://phabricator.wikimedia.org/T376179) [16:29:26] (03PS5) 10Herron: opentelemetry::collector: set default port and update template [puppet] - 10https://gerrit.wikimedia.org/r/1076006 (https://phabricator.wikimedia.org/T376179) [16:29:43] (03PS2) 10Herron: thanos-query: set OTEL_SERVICE_NAME env variable [puppet] - 10https://gerrit.wikimedia.org/r/1077068 (https://phabricator.wikimedia.org/T376179) [16:30:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:30:24] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:30:34] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10192858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:30:55] (03PS7) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [16:32:23] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:32:24] aikochou@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:32:53] (03CR) 10Herron: "this should sort out the OTLPResourceNoServiceName issue, along with the patch below it to get traces flowing from the titan hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1077068 (https://phabricator.wikimedia.org/T376179) (owner: 10Herron) [16:34:37] (03CR) 10Bking: wdqs-categories: introduce VM for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [16:41:23] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [16:41:24] ayounsi@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [16:43:50] (03CR) 10Vgutierrez: "tests are happy, particularly text/09-analytics-cookies.vtc, see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [16:44:17] (03PS7) 10Ryan Kemper: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [16:45:56] (03CR) 10BCornwall: varnish: Give 1% of views RSA cert warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [16:47:06] (03CR) 10Vgutierrez: "setting or unsetting an HTTP request header triggers a line on varnishlog output, var VMOD triggers something similar? could we use `std.l" [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [16:48:12] (03PS2) 10Dzahn: admin: add zoe to deployment (move from ldap_only_users) [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [16:48:29] stashbot not writing to wikitech is known. T376176 [16:48:29] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [16:48:30] T376176: stashbot logging to Wikitech fails because of SUL migration - https://phabricator.wikimedia.org/T376176 [16:48:35] (03CR) 10Dzahn: "PS2: manual rebase was needed because other users got added meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [16:49:17] (03PS3) 10Dzahn: admin: add zoe to deployment (move from ldap_only_users) [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [16:49:43] (03CR) 10Dzahn: "PS3: another fix because a user was removed from deployment since original upload" [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [16:50:27] (03CR) 10Ssingh: varnish: Give 1% of views RSA cert warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [16:50:40] (03CR) 10Dzahn: [C:03+2] "has approval from manager, deployment group owner, +1 from clinic duty... going ahead and merging" [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [16:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:59:12] (03CR) 10Scott French: [C:03+1] "Aside from the typo, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) (owner: 10RLazarus) [16:59:45] (03CR) 10Dzahn: [C:03+2] "Zoe's user has been created on deployment server deploy1003" [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1700) [17:00:41] 06SRE, 06Data-Engineering-Icebox, 06Traffic, 10WMF-General-or-Unknown, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803#10192971 (10matmarex) This is still a problem today, and it makes for a dis... [17:02:25] !log run extensions/CentralAuth/maintenance/migratePass0 on wikitech [17:02:25] taavi: Failed to log message to wiki. Somebody should check the error logs. [17:02:49] (03CR) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:03:59] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [17:04:00] aikochou@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [17:05:13] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10193016 (10Dzahn) @zoe Welcome to deployers! Your shell user has been created now. Within the next 30 min or so it will be created on all servers needed for deployment. Ther... [17:05:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10193018 (10Dzahn) 05Open→03Resolved [17:06:17] (03CR) 10Vgutierrez: varnish: Replace X-IS-ALT-DOMAIN with variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:06:28] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [17:06:29] jhathaway@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [17:06:56] !log welcome new deployment user zoe T373666 [17:06:57] mutante: Failed to log message to wiki. Somebody should check the error logs. [17:07:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from ores-legacy.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=ores-legacy.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:08:53] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:14:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10193059 (10Dzahn) @KFrancis A new NDA for a WMDE staff member is needed here. Thank you as always! @seanleong-WMDE Hi! Could you please email [[ https://meta.wikimedia... [17:14:23] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 - https://phabricator.wikimedia.org/T376094#10193074 (10Dzahn) [17:14:52] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10193075 (10Dzahn) [17:15:33] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [17:18:41] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [17:20:33] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:22:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from ores-legacy.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=ores-legacy.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:22:53] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:28:20] (03PS40) 10CDobbins: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:30:18] (03PS4) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [17:30:23] (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:32:21] (03PS5) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [17:33:39] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:33:51] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:19] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:34:25] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:35:02] (03PS1) 10Ladsgroup: Allow storing of passwords for local users in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) [17:35:33] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:37:24] (03CR) 10Ladsgroup: "in mwdebug1002, confirming it fixed the issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) (owner: 10Ladsgroup) [17:39:32] (03CR) 10Majavah: "why not set `wmgLocalAuthLoginOnly` instead? this is the only place where this is read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) (owner: 10Ladsgroup) [17:40:54] (03PS2) 10Ladsgroup: Allow storing of passwords for local users in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) [17:40:59] (03CR) 10Ladsgroup: "done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) (owner: 10Ladsgroup) [17:42:17] (03CR) 10CI reject: [V:04-1] sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:43:41] (03CR) 10Scott French: [C:03+2] "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074494 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [17:43:47] (03PS1) 10Aleksandar Mastilovic: Added a simple namespace for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077080 [17:45:15] (03Merged) 10jenkins-bot: shellbox: add support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074494 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [17:49:09] (03CR) 10Bking: [C:03+2] Added a simple namespace for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077080 (owner: 10Aleksandar Mastilovic) [17:49:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) (owner: 10Ladsgroup) [17:50:15] (03Merged) 10jenkins-bot: Allow storing of passwords for local users in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077079 (https://phabricator.wikimedia.org/T376140) (owner: 10Ladsgroup) [17:50:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [17:50:41] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1077079|Allow storing of passwords for local users in wikitech (T376140)]] [17:50:44] T376140: Setting new password at wikitech does not work - https://phabricator.wikimedia.org/T376140 [17:50:46] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10193222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [17:52:29] (03Merged) 10jenkins-bot: Added a simple namespace for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077080 (owner: 10Aleksandar Mastilovic) [17:53:09] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1077079|Allow storing of passwords for local users in wikitech (T376140)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:55:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:55:08] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:56:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:59:45] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077079|Allow storing of passwords for local users in wikitech (T376140)]] (duration: 09m 03s) [17:59:48] T376140: Setting new password at wikitech does not work - https://phabricator.wikimedia.org/T376140 [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T1800) [18:00:32] brennen: if I did it right, I promoted group0 to the new mediawiki version this "morning" :] [18:04:02] (03CR) 10Daniel Kinzler: REST: Make experimental endpoints available on beta and testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [18:05:04] (03CR) 10Scott French: "Thanks, Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1075981 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:05:24] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10193282 (10VRiley-WMF) [18:05:34] (03CR) 10Scott French: [C:03+2] scap: remove stale production dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1075981 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:07:35] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10193288 (10VRiley-WMF) @MoritzMuehlenhoff Once the server can be powered off, we will insert the RAM and when it powers back up, it should instantly recogize it. I'... [18:13:27] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10193306 (10Papaul) 05Open→03Resolved Add both power supplies in Netbox under inventory [18:18:21] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10193313 (10RobH) > Your appointment has been scheduled between Wed, Oct 2, 2024 8:00 AM and Wed, Oct 2, 2024 12:00 PM. Please check back here for updates. > Your technician is scheduled to arriv... [18:23:56] (03PS1) 10Ladsgroup: mariadb: Remove specific wikitech grants [puppet] - 10https://gerrit.wikimedia.org/r/1077082 (https://phabricator.wikimedia.org/T376129) [18:24:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:12] (03CR) 10BCornwall: [C:03+2] trafficserver: no logging on disabled monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1072295 (owner: 10BCornwall) [18:26:31] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1076871 (owner: 10Ncmonitor) [18:27:39] (03PS1) 10Ladsgroup: mariadb: Remove wikitech firewall holes [puppet] - 10https://gerrit.wikimedia.org/r/1077083 (https://phabricator.wikimedia.org/T376129) [18:28:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52775 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:35:11] 06SRE, 06Data-Engineering-Icebox, 06Traffic, 10WMF-General-or-Unknown, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803#10193353 (10Tgr) Yeah, the wider issue here is that setting the cookie on c... [18:35:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1077083 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [18:36:47] (03PS3) 10Scott French: shellbox-syntaxhighlight: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074495 (https://phabricator.wikimedia.org/T375243) [18:36:47] (03CR) 10Scott French: "This follows Iacc40c64d1b1376ea6c2cf3b4cdfd2f6c816f0e0 to actually turn up the first (currently staging-only) routed-via-main 8.1 deployme" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074495 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [18:38:43] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376192 (10phaultfinder) 03NEW [18:56:52] (03PS8) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [18:58:49] (03CR) 10Bking: wdqs-categories: introduce VM for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:05:44] (03PS1) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077088 (https://phabricator.wikimedia.org/T375687) [19:07:32] (03CR) 10CI reject: [V:04-1] wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077088 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:08:14] (03PS1) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077089 (https://phabricator.wikimedia.org/T375687) [19:08:34] (03Abandoned) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077088 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:10:05] (03CR) 10CI reject: [V:04-1] wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077089 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:10:08] (03PS9) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [19:11:43] (03Abandoned) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1077089 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:11:54] (03CR) 10CI reject: [V:04-1] wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:13:03] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18), 13Patch-For-Review: eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10193477 (10MoritzMuehlenhoff) Our upper boundary for VMs is 16G (the current virt servers have 64G, t... [19:18:42] (03CR) 10Ssingh: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [19:20:21] (03CR) 10Ssingh: "Issuing a recheck on this because CI for cookbooks is broken with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [19:24:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:51] (03CR) 10Ssingh: "Hi Ryan, Steve:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077066 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [19:24:57] (03PS4) 10BPirkle: REST: Make experimental endpoints available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) [19:24:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:21] (03CR) 10BPirkle: REST: Make experimental endpoints available on beta and testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [19:27:46] (03CR) 10BCornwall: [V:03+1 C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:33:37] (03CR) 10Ryan Kemper: wdqs max lag: target specific port (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [19:34:03] (03CR) 10Ryan Kemper: [C:03+2] wdqs max lag: target specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [19:35:15] (03Merged) 10jenkins-bot: wdqs max lag: target specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [19:35:18] (03PS2) 10Ryan Kemper: wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 [19:36:32] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 (owner: 10Ryan Kemper) [19:37:42] (03Merged) 10jenkins-bot: wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 (owner: 10Ryan Kemper) [19:39:09] (03CR) 10Ryan Kemper: "We can't merge this quite yet, but we will be able to merge it after this week's branch cut (#25) for the wikidata.org extension" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [19:39:23] (03PS1) 10CDanis: coredns: improve debuggability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 [19:47:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:48:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:50:01] (03Abandoned) 10CDanis: coredns: add support for Service externalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075311 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [19:50:46] (03PS2) 10CDanis: coredns: improve debuggability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 (https://phabricator.wikimedia.org/T344171) [19:51:34] (03PS10) 10Ryan Kemper: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [19:52:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:54:37] Gerrit is super slow right now... Is it just me? [19:57:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:59:43] (03CR) 10Daniel Kinzler: [C:03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241001T2000). [20:00:05] Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] o/ [20:00:25] I need to restart the CI Jenkins [20:00:27] that is qa quick op [20:01:41] I'll do it after Nemoralis patch [20:02:48] hmm [20:02:56] !log Restarting CI Jenkins [20:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:08] Nemoralis: I am doing your patch [20:03:12] ok [20:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076254 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:03:48] I am not sure we need to test it on the debug servers first [20:04:11] (03Merged) 10jenkins-bot: Update wgMetaNamespace for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076254 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:04:40] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1076254|Update wgMetaNamespace for tlywiki (T367009)]] [20:04:43] T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009 [20:05:08] it is namespace change, do you have to run maintenance script? [20:05:55] PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 112 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [20:05:59] yes [20:06:56] !log hashar@deploy2002 nmw03, hashar: Backport for [[gerrit:1076254|Update wgMetaNamespace for tlywiki (T367009)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:28] !log hashar@deploy2002 nmw03, hashar: Continuing with sync [20:07:33] lgtm by the way [20:07:35] https://tly.wikipedia.org/wiki/Xususi:NamespaceInfo [20:07:36] that is namespaceDupes [20:07:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:07:40] \o/ [20:08:53] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:09:55] * a8e2490b0cb - --help ftw ! (19 years ago) [20:09:59] yeah I have touched that script :D [20:10:34] Nemoralis: I am waiting for the deployment to have completed and I will then run the script [20:10:40] sure [20:10:41] iirc it autosolves issues [20:10:48] 20:10:44 K8s deployment progress: 70% (ok: 1705; fail: 0; left: 727) \ [20:10:50] in progress! [20:10:55] RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [20:12:02] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076254|Update wgMetaNamespace for tlywiki (T367009)]] (duration: 07m 21s) [20:12:05] T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009 [20:14:19] Nemoralis: the renames: https://phabricator.wikimedia.org/T367009#10193709 [20:15:48] did it rename automatically? [20:16:29] yeah should have [20:17:10] I think you have to pass "--fix" [20:17:20] You do [20:17:24] from docs: "Attempt to automatically fix errors. You must pass this option for the script to actually perform any change to the database. Otherwise, it will just print what would be done. The change instruction is a second option (e.g. --add-prefix). " [20:17:47] yes I did that after: https://phabricator.wikimedia.org/T367009#10193723 [20:18:38] nice, thanks [20:19:00] then I don't know whether that worked cause the FAQ link looks broken [20:19:23] eg https://tly.wikipedia.org/w/index.php?title=Vikipedija:FAQ&action=edit [20:21:29] https://i.imgur.com/D6R1iD3.png [20:21:33] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:22:53] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:23:43] 06SRE, 10wikitech.wikimedia.org: Setup LDAP account - https://phabricator.wikimedia.org/T376209 (10WRai-WMF) 03NEW [20:23:43] yeah that looks wrongish [20:24:13] at least I see Wikipedija when searching for all pages on https://tly.wikipedia.org/wiki/Xususi:H%C9%99mm%C9%99y_s%C9%99hifon?from=&to=&namespace=4 [20:25:07] but the old title is kept in the cache so that has to be purged [20:25:14] Nemoralis: I have purged a page and it shows up properly now https://tly.wikipedia.org/wiki/Vikipedija:AWB [20:26:09] yes, purging the page fixed it [20:26:14] thanks again [20:26:40] I am sure we had a script to list the pages from namepsace [20:27:12] ah found it [20:27:30] I just go to the search bar and type Project: [20:28:28] 06SRE, 10wikitech.wikimedia.org: Setup LDAP account - https://phabricator.wikimedia.org/T376209#10193753 (10RhinosF1) 05Open→03Invalid Hi, You can create an LDAP / Developer account at https://idm.wikimedia.org/signup/ If you need further help, you may want to jump onto the #wikimedia-tech channel on... [20:28:43] !log mwscript purgeList.php --wiki=tlywiki --namespace=4 # T367009 [20:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:46] T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009 [20:29:14] well that is for the CDN [20:29:23] some other cache should be purged :/ [20:30:27] Nemoralis: sorry I forgot how to purge them all :/ [20:30:45] no problem, everything look good to me [20:33:39] Nemoralis: congratulations! [20:34:00] !log UTC late backport window completed [20:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:13] thanks! [20:35:33] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:43:43] (03Abandoned) 10BCornwall: R:varnish::wikimedia_vcl: drop undefined metaparameters [puppet] - 10https://gerrit.wikimedia.org/r/735005 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [20:48:58] (03PS1) 10BryanDavis: hieradata: Bump striker-tools to 2024-10-01-204613-production [puppet] - 10https://gerrit.wikimedia.org/r/1077095 (https://phabricator.wikimedia.org/T376190) [20:50:55] (03PS1) 10Bking: dse-k8s: add kube_env config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1077096 (https://phabricator.wikimedia.org/T371994) [20:51:01] (03CR) 10BCornwall: "This was intentional per @bblack@wikimedia.org 's suggestion. Discussion was had on the tracker task and was left unresolved: https://phab" [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez) [20:51:25] (03CR) 10BCornwall: "Not resolved" [puppet] - 10https://gerrit.wikimedia.org/r/957680 (https://phabricator.wikimedia.org/T323723) (owner: 10Vgutierrez) [20:53:05] (03CR) 10BCornwall: [C:03+1] Remove puppetserver1002 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1076899 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [20:53:16] (03CR) 10Ladsgroup: [C:03+2] hieradata: Bump striker-tools to 2024-10-01-204613-production [puppet] - 10https://gerrit.wikimedia.org/r/1077095 (https://phabricator.wikimedia.org/T376190) (owner: 10BryanDavis) [20:54:15] (03PS2) 10Bking: dse-k8s: add kube_env config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1077096 (https://phabricator.wikimedia.org/T371994) [20:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:23:43] (03PS1) 10EoghanGaffney: Add ignore_missing_file_errors to releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077102 [21:25:29] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4172/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077102 (owner: 10EoghanGaffney) [21:29:17] (03PS1) 10Aleksandar Mastilovic: Adding a Helm chart for HDFS Synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077106 [21:41:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077096 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [21:45:25] (03CR) 10Dzahn: [C:03+2] "ah, right! thanks:)" [puppet] - 10https://gerrit.wikimedia.org/r/1077102 (owner: 10EoghanGaffney) [22:11:35] (03PS1) 10BryanDavis: [DNM] Readme: wrap lines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077108 (https://phabricator.wikimedia.org/T376217) [22:11:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10194082 (10Jclark-ctr) @jijiki if you get a chance to update site.pp preseed.yaml these can be imaged [22:13:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077108 (https://phabricator.wikimedia.org/T376217) (owner: 10BryanDavis) [22:15:27] (03Abandoned) 10BryanDavis: [DNM] Readme: wrap lines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077108 (https://phabricator.wikimedia.org/T376217) (owner: 10BryanDavis) [22:18:19] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:18:39] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:18:43] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:18:51] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:24:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [22:25:18] (03PS2) 10Scott French: sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) [22:25:19] (03PS1) 10Scott French: sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) [22:33:51] (03PS1) 10Volans: sre.wdqs.data-transfer: fix CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1077112 (https://phabricator.wikimedia.org/T364077) [22:38:20] (03CR) 10CI reject: [V:04-1] sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [22:38:20] (03CR) 10CI reject: [V:04-1] sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [22:46:04] (03CR) 10Volans: [C:03+2] "Self-merging to unblock CI for all the other pending CRs." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077112 (https://phabricator.wikimedia.org/T364077) (owner: 10Volans) [22:46:21] (03PS3) 10Scott French: sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) [22:46:35] (03PS41) 10CDobbins: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [22:46:44] (03PS3) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [22:48:09] (03CR) 10Scott French: "Thanks, Riccardo!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077112 (https://phabricator.wikimedia.org/T364077) (owner: 10Volans) [22:59:24] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [23:00:23] (03PS2) 10Scott French: sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) [23:06:48] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10194218 (10Papaul) [23:07:37] (03CR) 10C. Scott Ananian: [C:04-2] "needs a parser migration patch first; see I5d145a72aed5f993b6499f7b4e3b9ef07cb45d53" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (owner: 10C. Scott Ananian) [23:08:53] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:14:14] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10194252 (10Papaul) [23:21:33] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:53] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:25:13] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:33] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077119 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077119 (owner: 10TrainBranchBot) [23:42:50] !log zabe@mwmaint2002:~$ cat /home/zabe/s3.txt | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php {} --skip /home/zabe/text_table_cleanup/{} --dump /home/zabe/text_table_dump/{} --sleep 1" # T183490 [23:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:53] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490