[00:03:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071363 (owner: 10TrainBranchBot) [00:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:33:34] PROBLEM - MariaDB Replica Lag: s8 on db2200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 198040.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:48] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2036.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.c [02:43:48] et, wikikube-worker2044.codfw.wmnet, mw2431.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2011.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2413.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2048.codfw.wmnet, wikikube-worker2056.codfw.wmnet, mw2301.codfw.wmnet, m [02:43:48] dfw.wmnet, wikikube-worker2049.codfw.wmnet, parse2014.codfw.wmnet, parse2008.codfw.wmnet, wikikube-worker2035.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2426.co https://wikitech.wikimedia.org/wiki/PyBal [02:43:52] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, parse2020.codf [02:43:52] wikikube-worker2027.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2371.codfw.wmnet, kubernetes2006.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2304.codfw.wmnet, mw2449.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2451.codfw.wmnet, mw2399.codfw.wmnet, wikikube-worker2028.codfw.wmnet, wikikube-worker2013.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2051.codfw.wmnet, p [02:43:52] .codfw.wmnet, wikikube-worker2049.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worker2066.codfw.wmnet, wikikube-worker2003.codfw.wmnet, mw2442.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [02:44:48] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:44:52] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:29:48] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2079.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2337.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2041.co [03:29:48] t, mw2359.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2314.codfw.wmnet, mw2440.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2075.codfw.wmnet, mw2416.codfw.wmnet, mw2372.codfw.wmnet, parse2014.codfw.wmnet, parse2008.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worker2037.codfw.wmnet, mw2450.codfw.wmnet, wikikube-wor [03:29:48] codfw.wmnet, parse2007.codfw.wmnet, mw2369.codfw.wmnet, mw2445.codfw.wmnet, kubernetes2047.codfw.wmnet, mw2335.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2045.codfw.wmnet, kubernetes https://wikitech.wikimedia.org/wiki/PyBal [03:29:52] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2031.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030 [03:29:52] mnet, mw2352.codfw.wmnet, wikikube-worker2043.codfw.wmnet, mw2398.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2062.codfw.wmnet, mw2353.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2394.codfw.wmnet, mw2429.codfw.wmnet, mw2451.codfw.wmnet, mw2444.codfw.wmnet, kubernetes2049.codfw.wmnet, wikikube-worker2056.codfw.wmnet, mw2301.codfw.wmne [03:29:52] 6.codfw.wmnet, parse2008.codfw.wmnet, mw2371.codfw.wmnet, kubernetes2017.codfw.wmnet, wikikube-worker2094.codfw.wmnet, mw2450.codfw.wmnet, wikikube-worker2100.codfw.wmnet, mw2445.codfw. https://wikitech.wikimedia.org/wiki/PyBal [03:30:50] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:30:52] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:44:58] RECOVERY - Host gerrit1004 is UP: PING WARNING - Packet loss = 77%, RTA = 30.30 ms [03:51:22] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [03:52:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2008.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2052.codfw.wmnet, mw2443.codfw.wmnet, kuberne [03:52:52] codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2044.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2055.c [03:52:52] et, wikikube-worker2089.codfw.wmnet, wikikube-worker2062.codfw.wmnet, mw2449.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2429.codfw.wmnet, ku https://wikitech.wikimedia.org/wiki/PyBal [03:52:54] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2099.codfw.wmnet, mw2443.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, mw2431.codf [03:52:54] wikikube-worker2022.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2449.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.w [03:52:54] 2440.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2098.codfw.wmnet, mw2451.codfw.wmnet, kubernetes2013.codfw.wmnet, mw2304.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker204 https://wikitech.wikimedia.org/wiki/PyBal [03:53:54] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:54:52] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:09:54] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, parse2009.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, [04:09:54] e-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2359.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2451.codfw.wmnet, parse2012.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2048.codfw.wmnet, wikikube-worker2087.codfw.wmnet, wikikube-worker2028.codfw.wmnet, kubernetes2044.codfw.wmnet, wik [04:09:54] rker2056.codfw.wmnet, mw2301.codfw.wmnet, mw2417.codfw.wmnet, parse2014.codfw.wmnet, mw2395.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worke https://wikitech.wikimedia.org/wiki/PyBal [04:09:54] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2424.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2375.codfw.wmnet, mw2443.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2040.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worke [04:09:54] dfw.wmnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2394.codfw.wmnet, mw2314.codfw.wmnet, wiki [04:09:54] ker2059.codfw.wmnet, mw2440.codfw.wmnet, mw2399.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2075.codfw.wmnet, kubernetes2049.codfw.wmnet, wikikube-worker2013.codfw.wmnet, kubernetes https://wikitech.wikimedia.org/wiki/PyBal [04:10:54] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:10:54] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:39:42] RECOVERY - MariaDB Replica Lag: s8 on db2200 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:29:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2036.codfw.wmnet, kubernetes2052.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2352.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikub [05:29:58] 2060.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2089.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2028.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2301.codfw.wmnet, mw2416.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2031.codfw.wmnet, mw2442.codfw.wmnet, wikikube-worker2037.codfw.wmnet, parse2002.codfw.wmnet, wikikube-worker2085.codfw.wmnet, wikikube- [05:29:58] 08.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2305.codfw.wmnet, mw2366.codfw.wmnet, kubernetes2045.codfw.wmnet, wikikube-worker2019.codfw.wmnet, wikikube-worker2051.codfw.wmnet, mw2418. https://wikitech.wikimedia.org/wiki/PyBal [05:29:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2043.codfw.wmnet, mw2302.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.co [05:29:58] t, mw2368.codfw.wmnet, parse2012.codfw.wmnet, wikikube-worker2028.codfw.wmnet, wikikube-worker2056.codfw.wmnet, mw2301.codfw.wmnet, parse2008.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worker2037.codfw.wmnet, wikikube-worker2012.codfw.wmnet, parse2007.codfw.wmnet, mw2374.codfw.wmnet, mw2445.codfw.wmnet, kubernetes2047.codfw.wmnet, mw2335.codfw.wmnet, mw2337.codfw.wmnet, wikikube-worker2039.codfw.wmnet, kubernetes2060.codfw.wmn [05:29:58] kube-worker2019.codfw.wmnet, mw2282.codfw.wmnet, wikikube-worker2064.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2043.codfw.wmnet, parse2015.codfw.wmnet, kubernetes2038.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal [05:30:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:30:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:57:45] (03Abandoned) 10Stang: arwiki: Remove entries from wgSemiprotectedRestrictionLevels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048867 (https://phabricator.wikimedia.org/T368207) (owner: 10Stang) [06:03:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.co [06:03:00] t, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, parse2020.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, parse2013.cod [06:03:00] , wikikube-worker2062.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2413.codfw.wmnet, mw2314.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw24 https://wikitech.wikimedia.org/wiki/PyBal [06:04:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2010. [06:04:00] net, wikikube-worker2030.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, mw2352.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2014.codfw.wmnet, parse2013.codfw.wmnet, mw2353.codfw.wmnet, mw2413.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.wmnet, wikikube-worker2059.codfw.wmnet, kubernetes [06:04:00] fw.wmnet, mw2355.codfw.wmnet, parse2012.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2101.codfw.wmnet, kubernetes2049.codfw.wmnet, wikikube-worker2013.codfw.wmnet, mw2416.cod https://wikitech.wikimedia.org/wiki/PyBal [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:06:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:30:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2424.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.w [06:30:58] bernetes2052.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2431.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, [06:30:58] -worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2008.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [06:30:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2081.codfw.wmnet, mw2375.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, parse2009.codfw.wmne [06:30:58] 0.codfw.wmnet, wikikube-worker2084.codfw.wmnet, mw2368.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2007 [06:30:58] mnet, wikikube-worker2039.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worke https://wikitech.wikimedia.org/wiki/PyBal [06:32:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:32:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:54:02] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, [06:54:02] 3.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2431.codfw.wmnet, mw2351.codfw.wmnet, parse2020.codfw.wmnet, mw2352.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2313.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-w [06:54:02] 5.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2449.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2429.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal [06:56:02] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:56:20] !log installing aom security updates [06:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:50] (03PS1) 10Slyngshede: P:idp Limit groups sent from CAS to Turnilo. [puppet] - 10https://gerrit.wikimedia.org/r/1071470 (https://phabricator.wikimedia.org/T369205) [07:03:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071470 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [07:03:29] (03CR) 10Slyngshede: [C:03+2] P:idp Limit groups sent from CAS to Turnilo. [puppet] - 10https://gerrit.wikimedia.org/r/1071470 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [07:06:27] !log roll out debmonitor-client 0.4.0-2+deb11u1 on bullseye hosts [07:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:02] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:16:10] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:17:40] !log installing Linux 5.10.223 on bullseye hosts [07:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:07] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [07:33:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [07:33:55] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host kubestage2002.codfw.wmnet [07:33:57] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2002.codfw.wmnet [07:34:04] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128731 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumbering for host kubestage2002.codfw.wmnet [07:34:04] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:54] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2002.codfw.wmnet [07:37:14] (03PS25) 10Slyngshede: P:mirrors::debian Export mirror age to textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/1003442 [07:37:18] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bookworm [07:37:28] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm [07:37:38] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestage2002.codfw.wmnet with OS bookworm [07:37:38] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host kubestage2002.codfw.wmnet [07:37:49] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm executed... [07:37:50] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128734 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host kubestage2002.codfw.wmnet completed:... [07:39:20] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bookworm [07:42:03] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10128736 (10ArthurTaylor) @WMDE-leszek I guess Kara is away right now. Can you approve on her behalf? [07:42:07] (03PS26) 10Slyngshede: P:mirrors::debian Export mirror age to textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/1003442 [07:42:15] (03PS1) 10Elukey: sre.hosts.provision: enable virtualization for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) [07:43:02] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3908/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [07:51:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 29 hosts with reason: Primary switchover s7 T373175 [07:51:11] T373175: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T373175 [07:51:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 29 hosts with reason: Primary switchover s7 T373175 [07:51:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T373175', diff saved to https://phabricator.wikimedia.org/P68733 and previous config saved to /var/cache/conftool/dbconfig/20240909-075145-arnaudb.json [07:52:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2218 from API/vslow/dump T373175', diff saved to https://phabricator.wikimedia.org/P68734 and previous config saved to /var/cache/conftool/dbconfig/20240909-075258-arnaudb.json [07:58:42] !log installing openssl security updates [07:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:14] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1065134 (https://phabricator.wikimedia.org/T373175) (owner: 10Gerrit maintenance bot) [07:59:48] (03PS2) 10Brouberol: dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069944 (https://phabricator.wikimedia.org/T369492) [08:00:09] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [08:00:29] !log Starting s7 codfw failover from db2220 to db2218 - T373175 [08:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:32] T373175: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T373175 [08:01:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T373175', diff saved to https://phabricator.wikimedia.org/P68735 and previous config saved to /var/cache/conftool/dbconfig/20240909-080108-arnaudb.json [08:02:27] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10128815 (10hashar) [08:04:15] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069944 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [08:04:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 T373175', diff saved to https://phabricator.wikimedia.org/P68736 and previous config saved to /var/cache/conftool/dbconfig/20240909-080422-arnaudb.json [08:04:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [08:05:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 T373175', diff saved to https://phabricator.wikimedia.org/P68737 and previous config saved to /var/cache/conftool/dbconfig/20240909-080558-arnaudb.json [08:06:01] T373175: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T373175 [08:07:54] (03CR) 10David Caro: [V:03+1] spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [08:08:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T373330 [08:08:52] T373330: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T373330 [08:09:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T373330 [08:09:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T373330', diff saved to https://phabricator.wikimedia.org/P68738 and previous config saved to /var/cache/conftool/dbconfig/20240909-080935-arnaudb.json [08:09:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:09:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T373330', diff saved to https://phabricator.wikimedia.org/P68739 and previous config saved to /var/cache/conftool/dbconfig/20240909-080956-arnaudb.json [08:09:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:12:23] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10128850 (10WMDE-leszek) I approve on WMDE's behalf. Thank you [08:12:26] (03PS1) 10Slyngshede: data.yaml: Extend NDA for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1071558 [08:15:24] jouncebot: next [08:15:24] In 1 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1000) [08:15:53] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1066753 (https://phabricator.wikimedia.org/T373330) (owner: 10Gerrit maintenance bot) [08:16:45] (03CR) 10Muehlenhoff: data.yaml: Extend NDA for ncreasy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071558 (owner: 10Slyngshede) [08:17:12] !log Starting s4 codfw failover from db2140 to db2179 - T373330 [08:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] T373330: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T373330 [08:17:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T373330', diff saved to https://phabricator.wikimedia.org/P68740 and previous config saved to /var/cache/conftool/dbconfig/20240909-081750-arnaudb.json [08:20:01] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host dragonfly-supernode2001.codfw.wmnet with OS bookworm [08:20:09] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10128871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host dragonfly-supernode2001.codfw.wmnet with OS bookworm [08:20:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Reconfig db2140 T373330', diff saved to https://phabricator.wikimedia.org/P68741 and previous config saved to /var/cache/conftool/dbconfig/20240909-082053-arnaudb.json [08:22:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bookworm [08:23:32] (03PS1) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [08:23:53] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2002.codfw.wmnet [08:23:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2002.codfw.wmnet [08:25:03] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2320.codfw.wmnet [08:25:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2320.codfw.wmnet [08:25:10] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2321.codfw.wmnet [08:25:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2321.codfw.wmnet [08:25:18] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2322.codfw.wmnet [08:25:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2322.codfw.wmnet [08:25:20] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128883 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2320.codfw.wmnet [08:25:21] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128884 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2321.codfw.wmnet [08:25:27] (03CR) 10JMeybohm: [C:03+1] dse-k8s-eqiad: Disable PSP [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [08:25:28] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128885 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2322.codfw.wmnet [08:25:34] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2332.codfw.wmnet [08:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2332.codfw.wmnet [08:25:44] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128886 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2332.codfw.wmnet [08:25:57] (03CR) 10JMeybohm: [C:03+1] deployment_server: add wikidata-query-gui service [puppet] - 10https://gerrit.wikimedia.org/r/1071075 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:26:12] (03CR) 10JMeybohm: [C:03+1] sre.k8s.renumber-node: Refactor logging and error handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [08:26:13] FIRING: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:26:44] (03CR) 10JMeybohm: [C:03+1] sre.k8s.renumber-node: Run puppet on registry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 (owner: 10Clément Goubert) [08:26:46] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2031.codfw.wmnet [08:26:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2031.codfw.wmnet [08:26:59] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128888 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2031.codfw.wmnet [08:27:09] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2034.codfw.wmnet [08:27:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2034.codfw.wmnet [08:27:21] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128889 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2034.codfw.wmnet [08:27:52] (03PS1) 10Elukey: Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) [08:29:39] (03CR) 10CI reject: [V:04-1] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [08:30:01] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128909 (10MoritzMuehlenhoff) Something went wrong with the 2430 rename, it's still showing up in Puppetboard: https://puppetboard.wikimedia.org/node/mw2430... [08:31:07] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2008.codfw.wmnet [08:31:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2008.codfw.wmnet [08:31:19] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128913 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2008.codfw.wmnet [08:31:29] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2010.codfw.wmnet [08:31:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2010.codfw.wmnet [08:31:41] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128914 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2010.codfw.wmnet [08:31:49] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2018.codfw.wmnet [08:31:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2018.codfw.wmnet [08:32:02] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128915 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2018.codfw.wmnet [08:32:11] (03CR) 10Elukey: "The helm dependency is not present on Bookworm, but helm3 is present in bullseye (so probably it can be copied over in case). Not sure if " [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [08:33:57] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2025.codfw.wmnet [08:33:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2025.codfw.wmnet [08:34:14] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128919 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2025.codfw.wmnet [08:34:14] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2027.codfw.wmnet [08:34:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2027.codfw.wmnet [08:34:24] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128925 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2027.codfw.wmnet [08:34:31] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2028.codfw.wmnet [08:34:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2028.codfw.wmnet [08:34:37] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2029.codfw.wmnet [08:34:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2029.codfw.wmnet [08:34:46] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128926 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2028.codfw.wmnet [08:34:49] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969#10128924 (10elukey) Tried to file a patch but I realized that we don't have the `helm` package for Bookworm/Bullseye, so the build fails. I am wondering if the current version of chartmuseum r... [08:34:50] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128927 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2029.codfw.wmnet [08:35:06] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2033.codfw.wmnet [08:35:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2033.codfw.wmnet [08:35:14] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2035.codfw.wmnet [08:35:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2035.codfw.wmnet [08:35:16] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128929 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2033.codfw.wmnet [08:35:25] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128930 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2035.codfw.wmnet [08:35:37] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host lists2001.wikimedia.org [08:35:40] (03PS2) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [08:35:44] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2055.codfw.wmnet [08:35:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2055.codfw.wmnet [08:35:51] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2057.codfw.wmnet [08:35:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2057.codfw.wmnet [08:36:03] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128931 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2055.codfw.wmnet [08:36:06] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128933 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2057.codfw.wmnet [08:36:41] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2055.codfw.wmnet [08:36:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2055.codfw.wmnet [08:36:51] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dragonfly-supernode2001.codfw.wmnet with reason: host reimage [08:36:56] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128936 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2055.codfw.wmnet [08:37:00] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kubernetes2054.codfw.wmnet [08:37:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kubernetes2054.codfw.wmnet [08:37:15] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10128944 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: kubernetes2054.codfw.wmnet [08:38:04] (03CR) 10AikoChou: [C:03+1] ml-services: re-deploy prod articlequality and update staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071232 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:38:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374086 [08:38:27] T374086: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T374086 [08:38:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374086 [08:39:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2213 from API/vslow/dump T374086', diff saved to https://phabricator.wikimedia.org/P68742 and previous config saved to /var/cache/conftool/dbconfig/20240909-083910-arnaudb.json [08:40:06] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dragonfly-supernode2001.codfw.wmnet with reason: host reimage [08:41:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.wikimedia.org [08:45:25] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1070870 (https://phabricator.wikimedia.org/T374086) (owner: 10Gerrit maintenance bot) [08:46:44] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969#10129014 (10JMeybohm) helm3 should be fine. It might as well be that the build-dependency is not required as we're doing a full vendor anyways. I don't recall why the it is there, sorry [08:47:15] !log Starting s5 codfw failover from db2123 to db2213 - T374086 [08:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:18] T374086: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T374086 [08:48:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T374086', diff saved to https://phabricator.wikimedia.org/P68743 and previous config saved to /var/cache/conftool/dbconfig/20240909-084810-arnaudb.json [08:49:34] (03CR) 10CI reject: [V:04-1] kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [08:49:37] (03CR) 10Muehlenhoff: [C:03+2] Extend MX Cumin aliases for new postfix roles [puppet] - 10https://gerrit.wikimedia.org/r/1070967 (owner: 10Muehlenhoff) [08:50:26] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: re-deploy prod articlequality and update staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071232 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:50:43] RESOLVED: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'API/vslow/dump T374086', diff saved to https://phabricator.wikimedia.org/P68744 and previous config saved to /var/cache/conftool/dbconfig/20240909-085122-arnaudb.json [08:51:37] (03Merged) 10jenkins-bot: ml-services: re-deploy prod articlequality and update staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071232 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:51:57] (03PS3) 10JMeybohm: afka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [08:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:54:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet with OS bookworm [08:54:36] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10129047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host dragonfly-supernode2001.codfw.wmnet with OS bookworm completed: - dragonfly-supernode2001 (**PASS**... [08:56:16] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10129052 (10elukey) First node reimaged! Everything looks good afaics. Next steps: * Wait some MW deployments to make sure that nothing unexpected pops up. * Reimage the eqiad VM. [08:56:56] (03PS2) 10Slyngshede: data.yaml: Extend NDA for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1071558 [08:57:38] (03PS3) 10Slyngshede: data.yaml: Extend NDA for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1071558 [08:57:50] !log restarting postfix on mx-in/mx-out to pick up openssl updates [08:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071558 (owner: 10Slyngshede) [08:58:26] (03CR) 10Slyngshede: data.yaml: Extend NDA for ncreasy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071558 (owner: 10Slyngshede) [08:58:52] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend NDA for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1071558 (owner: 10Slyngshede) [09:00:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10129063 (10Jelto) I depooled `gitlab-runner2003` for tomorrows maintenance [09:01:18] (03CR) 10Hnowlan: [C:03+1] kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [09:01:22] (03CR) 10Slyngshede: [C:03+2] P:idm: Add ecdsa-sha2-nistp256 to allowed key types. [puppet] - 10https://gerrit.wikimedia.org/r/1071123 (https://phabricator.wikimedia.org/T371956) (owner: 10Slyngshede) [09:01:40] (03PS2) 10Hnowlan: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) [09:01:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [09:03:21] (03PS1) 10Muehlenhoff: mx: Enable profile::auto_restarts::service for rspamd [puppet] - 10https://gerrit.wikimedia.org/r/1071564 (https://phabricator.wikimedia.org/T135991) [09:04:46] (03CR) 10CI reject: [V:04-1] afka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [09:06:47] (03PS6) 10Stevemunene: Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) [09:07:04] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=mnwiki --add-prefix=BROKEN --fix # T366271 [09:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:07] T366271: Change Wikipedia: and Wikipedia_talk: namespaces for Mongolian (for Mongolian Wikipedia) - https://phabricator.wikimedia.org/T366271 [09:10:34] (03CR) 10Btullis: [C:03+1] Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [09:15:00] FIRING: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:16:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:41] (03PS2) 10Slyngshede: P:idp Prometheus blackbox monitoring for IDP. [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) [09:18:36] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10129107 (10MoritzMuehlenhoff) @Dzahn gerrit1004 is still in puppetdb: https://puppetboard.wikimedia.org/catalog/gerrit1004.wikimedia.org [09:18:39] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: gerrit1004.wikimedia.org [09:18:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: gerrit1004.wikimedia.org [09:18:43] (03CR) 10Slyngshede: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [09:18:43] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10129108 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: gerrit1004.wikimedia.org [09:19:17] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2227 [puppet] - 10https://gerrit.wikimedia.org/r/1070946 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [09:19:56] (03PS4) 10JMeybohm: afka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [09:21:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: provisionning db2227.codfw.wmnet - T373579 [09:21:31] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [09:21:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: provisionning db2227.codfw.wmnet - T373579 [09:21:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: provisionning db2227.codfw.wmnet - T373579 [09:21:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: provisionning db2227.codfw.wmnet - T373579 [09:22:54] (03PS1) 10Seanleong-wmde: Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 [09:24:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2127 in db2227 for T373579', diff saved to https://phabricator.wikimedia.org/P68745 and previous config saved to /var/cache/conftool/dbconfig/20240909-092404-arnaudb.json [09:25:00] FIRING: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:25] FIRING: SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:34] (03CR) 10David Caro: [V:03+1 C:03+2] prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:25:36] !log removing libssl1.1 from prometheus hosts which were dist-upgraded from bullseye to bookworm [09:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:27] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2127.codfw.wmnet onto db2227.codfw.wmnet [09:27:30] (03PS5) 10JMeybohm: afka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [09:28:58] (03CR) 10Jelto: [C:03+2] deployment_server: add wikidata-query-gui service [puppet] - 10https://gerrit.wikimedia.org/r/1071075 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:29:55] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071569 [09:30:25] FIRING: [2x] SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:25] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071569 (owner: 10Muehlenhoff) [09:32:26] (03PS2) 10Seanleong-wmde: Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 [09:33:31] (03PS1) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1071570 (https://phabricator.wikimedia.org/T373579) [09:33:50] (03PS1) 10Brouberol: cloudnative-pg: upgrade operator to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071571 (https://phabricator.wikimedia.org/T369492) [09:34:45] (03CR) 10Btullis: [C:03+1] cloudnative-pg: upgrade operator to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071571 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:35:07] (03CR) 10Hnowlan: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [09:35:59] (03PS1) 10Slyngshede: P:idm missing comma in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1071572 [09:36:49] (03CR) 10Slyngshede: [C:03+2] P:idm missing comma in settings. [puppet] - 10https://gerrit.wikimedia.org/r/1071572 (owner: 10Slyngshede) [09:37:35] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: upgrade operator to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071571 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:38:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:38:33] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [09:38:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:39:44] (03CR) 10Jelto: [C:03+1] lists: Mask mailman3 service on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/1071237 (owner: 10EoghanGaffney) [09:40:00] FIRING: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:25] FIRING: [2x] SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:35] (03PS1) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:41:17] (03PS2) 10EoghanGaffney: lists: Mask mailman3 service on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/1071237 [09:42:07] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3909/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:42:54] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:42:59] (03CR) 10David Caro: [V:03+1] prometheus: Add missing maintain_dbusers_primary key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:43:10] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3910/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071237 (owner: 10EoghanGaffney) [09:43:18] (03PS2) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:43:23] (03CR) 10David Caro: prometheus: Add missing maintain_dbusers_primary key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:43:55] (03PS3) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:44:02] (03CR) 10David Caro: "Now really ready, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:44:08] (03PS1) 10Brouberol: cloudnative-pg: include the restricted security context in the test pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071574 (https://phabricator.wikimedia.org/T369492) [09:44:21] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:44:27] (03PS4) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:44:38] (03CR) 10CI reject: [V:04-1] afka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [09:44:46] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:45:00] RESOLVED: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:07] (03PS6) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [09:45:08] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Mask mailman3 service on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/1071237 (owner: 10EoghanGaffney) [09:45:27] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071569 (owner: 10Muehlenhoff) [09:45:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, mw1434.eqiad.wmnet, mw1386.eqiad.wmnet, mw1430.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, mw1484.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, parse1005.eqiad.wmnet, mw1408.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, [09:45:31] tes1017.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1466.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1419.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, mw1356.eqiad.wmnet, mw1483.eqiad.wmnet, mw1371.eqiad.wmnet, parse1012.eqiad.wmnet, mw1453.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1031.eqiad.wmnet, parse1019.eqia [09:45:31] mw1381.eqiad.wmnet, parse1021.eqiad.wmnet, parse1003.eqiad.wmnet, mw1441.eqiad.wmnet, wikikube-worker1028.eqiad.wmnet, mw1472.eqiad.wmnet, wikikube-worker1031.eqiad.wmnet, wikikube-wor https://wikitech.wikimedia.org/wiki/PyBal [09:45:54] eoghan: I'll merge your mailman patch along [09:46:09] (03CR) 10Btullis: [C:03+1] cloudnative-pg: include the restricted security context in the test pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071574 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:46:22] (03PS7) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [09:47:29] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:47:51] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: include the restricted security context in the test pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071574 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:48:44] moritzm: Thanks! [09:48:47] (03CR) 10Jelto: [C:03+2] admin_ng: add wikidata-query-gui service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:49:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (owner: 10Seanleong-wmde) [09:49:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (owner: 10Seanleong-wmde) [09:50:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (owner: 10Seanleong-wmde) [09:52:20] (03PS5) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:52:34] (03Merged) 10jenkins-bot: admin_ng: add wikidata-query-gui service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:54:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [09:56:16] (03PS5) 10Muehlenhoff: vrts: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [09:56:17] (03CR) 10Muehlenhoff: "Looks good, all remaining firewall definitions are compatible with nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [09:56:37] (03PS4) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 [09:56:45] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [09:56:52] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Refactor logging and error handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [09:56:58] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 (owner: 10Clément Goubert) [09:57:07] (03PS6) 10David Caro: prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 [09:57:23] (03CR) 10Clément Goubert: [C:03+1] renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 (owner: 10JMeybohm) [09:58:50] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:59:23] (03CR) 10David Caro: [C:03+2] prometheus: Add missing maintain_dbusers_primary key [puppet] - 10https://gerrit.wikimedia.org/r/1071573 (owner: 10David Caro) [09:59:29] (03PS1) 10Hnowlan: rest-gateway: remove knowledge-gap configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071575 (https://phabricator.wikimedia.org/T342213) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1000) [10:10:28] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [10:10:28] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Refactor logging and error handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [10:10:45] (03PS5) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) [10:11:00] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 (owner: 10Clément Goubert) [10:12:09] (03CR) 10Elukey: [C:03+1] "LGTM, left a comment to add extra safety, but if you don't like it you can proceed :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [10:12:49] (03PS1) 10David Caro: prometheus: skip maintain_dbusers jobs if empty [puppet] - 10https://gerrit.wikimedia.org/r/1071577 [10:13:14] (03CR) 10CI reject: [V:04-1] prometheus: skip maintain_dbusers jobs if empty [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [10:16:13] (03PS1) 10Muehlenhoff: Add cloudidm* to cloud-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1071578 [10:18:00] !log jelto@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:18:19] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:19:47] (03CR) 10Muehlenhoff: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [10:19:59] !log jelto@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:23:14] (03PS1) 10David Caro: Revert "prometheus::cloud: add maintaindbusers target" [puppet] - 10https://gerrit.wikimedia.org/r/1071579 [10:27:45] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3914/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071579 (owner: 10David Caro) [10:27:49] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: remove knowledge-gap configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071575 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:28:10] (03CR) 10David Caro: [V:03+1 C:03+2] Revert "prometheus::cloud: add maintaindbusers target" [puppet] - 10https://gerrit.wikimedia.org/r/1071579 (owner: 10David Caro) [10:29:54] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [10:30:40] !log jelto@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:31:42] !log jelto@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:32:03] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [10:32:44] (03CR) 10Santiago Faci: [C:03+2] MPIC: Moving monitoring configuration from chart to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [10:33:36] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:33:47] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:34:24] (03Merged) 10jenkins-bot: MPIC: Moving monitoring configuration from chart to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [10:34:32] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:37:47] (03PS1) 10Santiago Faci: MPIC: Deploying a new release v0.1.5 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071580 (https://phabricator.wikimedia.org/T361346) [10:39:55] (03PS2) 10David Caro: prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 [10:40:18] (03CR) 10CI reject: [V:04-1] prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [10:42:20] (03PS3) 10David Caro: prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 [10:43:31] (03PS1) 10Brouberol: cloudnative-pg: allow the test pod to reach the PG instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071581 (https://phabricator.wikimedia.org/T373503) [10:45:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2127.codfw.wmnet onto db2227.codfw.wmnet [10:45:51] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [10:46:36] (03CR) 10David Caro: prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [10:48:06] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071582 [10:50:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: post db2227 clone repool', diff saved to https://phabricator.wikimedia.org/P68746 and previous config saved to /var/cache/conftool/dbconfig/20240909-105056-arnaudb.json [10:53:57] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071581 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [10:54:16] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: allow the test pod to reach the PG instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071581 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [10:55:06] (03CR) 10Elukey: "Hey Ilias I think that the rebase didn't work, the changelog is not correct anymore :(" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [11:01:10] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2293.codfw.wmnet [11:01:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2293.codfw.wmnet [11:01:24] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129371 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2293.codfw.wmnet [11:01:25] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2295.codfw.wmnet [11:01:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2295.codfw.wmnet [11:01:32] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2296.codfw.wmnet [11:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2296.codfw.wmnet [11:01:35] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129372 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2295.codfw.wmnet [11:01:43] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129373 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2296.codfw.wmnet [11:01:53] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: Disable PSP [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [11:02:08] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2312.codfw.wmnet [11:02:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2312.codfw.wmnet [11:02:16] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2316.codfw.wmnet [11:02:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2316.codfw.wmnet [11:02:19] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129375 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2312.codfw.wmnet [11:02:27] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129376 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2316.codfw.wmnet [11:02:44] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2317.codfw.wmnet [11:02:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2317.codfw.wmnet [11:02:51] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2318.codfw.wmnet [11:02:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2318.codfw.wmnet [11:02:58] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129391 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2317.codfw.wmnet [11:02:58] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2319.codfw.wmnet [11:02:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2319.codfw.wmnet [11:03:02] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129393 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2318.codfw.wmnet [11:03:08] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129394 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2319.codfw.wmnet [11:03:32] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2377.codfw.wmnet [11:03:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2377.codfw.wmnet [11:03:39] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2378.codfw.wmnet [11:03:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2378.codfw.wmnet [11:03:46] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129395 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2377.codfw.wmnet [11:03:48] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129396 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2378.codfw.wmnet [11:04:47] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129397 (10MoritzMuehlenhoff) mw2379 is also still in puppetboard: https://puppetboard.wikimedia.org/catalog/mw2379.codfw.wmnet [11:05:19] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2380.codfw.wmnet [11:05:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2380.codfw.wmnet [11:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:26] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2381.codfw.wmnet [11:05:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2381.codfw.wmnet [11:05:32] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2382.codfw.wmnet [11:05:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2382.codfw.wmnet [11:05:33] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129398 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2380.codfw.wmnet [11:05:38] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129399 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2381.codfw.wmnet [11:05:38] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2383.codfw.wmnet [11:05:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2383.codfw.wmnet [11:05:41] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129400 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2382.codfw.wmnet [11:05:49] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129401 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2383.codfw.wmnet [11:06:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: post db2227 clone repool', diff saved to https://phabricator.wikimedia.org/P68747 and previous config saved to /var/cache/conftool/dbconfig/20240909-110601-arnaudb.json [11:06:04] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2384.codfw.wmnet [11:06:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2384.codfw.wmnet [11:06:10] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2385.codfw.wmnet [11:06:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2385.codfw.wmnet [11:06:15] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129404 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2384.codfw.wmnet [11:06:16] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2386.codfw.wmnet [11:06:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2386.codfw.wmnet [11:06:20] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129405 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2385.codfw.wmnet [11:06:24] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2387.codfw.wmnet [11:06:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2387.codfw.wmnet [11:06:26] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129406 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2386.codfw.wmnet [11:06:34] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129407 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2387.codfw.wmnet [11:06:48] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2388.codfw.wmnet [11:06:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2388.codfw.wmnet [11:06:54] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2389.codfw.wmnet [11:06:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2389.codfw.wmnet [11:07:03] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129408 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2388.codfw.wmnet [11:07:06] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129409 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2389.codfw.wmnet [11:07:23] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2402.codfw.wmnet [11:07:23] (03CR) 10Ilias Sarantopoulos: "Hey! I fixed the conflicts manually by merging the changelog entries and sorting them by date." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [11:07:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2402.codfw.wmnet [11:07:29] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2406.codfw.wmnet [11:07:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2406.codfw.wmnet [11:07:35] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129410 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2402.codfw.wmnet [11:07:41] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129411 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2406.codfw.wmnet [11:07:43] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2407.codfw.wmnet [11:07:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2407.codfw.wmnet [11:07:58] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129412 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2407.codfw.wmnet [11:08:07] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2420.codfw.wmnet [11:08:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2420.codfw.wmnet [11:08:13] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2422.codfw.wmnet [11:08:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2422.codfw.wmnet [11:08:36] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2423.codfw.wmnet [11:08:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2423.codfw.wmnet [11:08:55] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129416 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2420.codfw.wmnet [11:09:01] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2430.codfw.wmnet [11:09:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2430.codfw.wmnet [11:09:08] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2434.codfw.wmnet [11:09:09] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129418 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2422.codfw.wmnet [11:09:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2434.codfw.wmnet [11:09:15] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2435.codfw.wmnet [11:09:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2435.codfw.wmnet [11:09:22] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129419 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2423.codfw.wmnet [11:09:25] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2379.codfw.wmnet [11:09:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2379.codfw.wmnet [11:09:38] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129420 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2430.codfw.wmnet [11:09:46] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129421 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2434.codfw.wmnet [11:09:55] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129422 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2435.codfw.wmnet [11:10:01] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129423 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: mw2379.codfw.wmnet [11:11:02] (03CR) 10Hnowlan: [C:03+2] rest-gateway: remove knowledge-gap configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071575 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [11:12:09] (03Merged) 10jenkins-bot: rest-gateway: remove knowledge-gap configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071575 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [11:12:12] (03PS1) 10Clément Goubert: kubernetes: Rename two workers [puppet] - 10https://gerrit.wikimedia.org/r/1071583 (https://phabricator.wikimedia.org/T372878) [11:12:37] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071580 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [11:17:27] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying a new release v0.1.5 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071580 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [11:18:31] (03Merged) 10jenkins-bot: MPIC: Deploying a new release v0.1.5 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071580 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [11:20:34] (03Abandoned) 10Brouberol: Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [11:21:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: post db2227 clone repool', diff saved to https://phabricator.wikimedia.org/P68748 and previous config saved to /var/cache/conftool/dbconfig/20240909-112107-arnaudb.json [11:21:10] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:22:12] (03CR) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [11:25:30] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:25:45] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [11:30:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2220.codfw.wmnet with reason: Maintenance [11:31:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2220.codfw.wmnet with reason: Maintenance [11:31:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T370903)', diff saved to https://phabricator.wikimedia.org/P68749 and previous config saved to /var/cache/conftool/dbconfig/20240909-113110-ladsgroup.json [11:31:16] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:34:27] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10129481 (10Ladsgroup) [11:36:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: post db2227 clone repool', diff saved to https://phabricator.wikimedia.org/P68750 and previous config saved to /var/cache/conftool/dbconfig/20240909-113613-arnaudb.json [11:36:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:00] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351 (10Clement_Goubert) 03NEW [11:38:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T370903)', diff saved to https://phabricator.wikimedia.org/P68751 and previous config saved to /var/cache/conftool/dbconfig/20240909-113759-ladsgroup.json [11:38:03] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:38:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [11:38:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [11:38:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T371742)', diff saved to https://phabricator.wikimedia.org/P68752 and previous config saved to /var/cache/conftool/dbconfig/20240909-113849-ladsgroup.json [11:38:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:49:15] (03PS7) 10Stevemunene: Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) [11:49:36] (03CR) 10CI reject: [V:04-1] Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [11:50:46] (03PS8) 10Stevemunene: Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) [11:52:03] (03CR) 10Slyngshede: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [11:53:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P68753 and previous config saved to /var/cache/conftool/dbconfig/20240909-115306-ladsgroup.json [11:53:56] (03CR) 10Stevemunene: Configure prometheus metrics on the cephosd cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [11:53:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:54:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:55:40] (03CR) 10Slyngshede: [V:03+1] P:mirrors::debian Export mirror age to textfile exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [11:55:45] (03CR) 10Arnaudb: [C:03+1] "great!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:58:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:01:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:41] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [12:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:03] (03PS2) 10Dreamy Jazz: Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) [12:07:52] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071582 (owner: 10Muehlenhoff) [12:07:58] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [12:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] (03PS1) 10Clément Goubert: sre.hosts.rename: Disable puppet to avoid race-condition [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) [12:08:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P68754 and previous config saved to /var/cache/conftool/dbconfig/20240909-120814-ladsgroup.json [12:08:57] (03CR) 10Filippo Giunchedi: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [12:10:54] (03PS3) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [12:11:20] (03CR) 10EoghanGaffney: "For the rest REST runner, I don't think having multiple would cause an issue. The documentation you linked to mentions that the default nu" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney) [12:11:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host dragonfly-supernode2001.codfw.wmnet [12:12:02] (03CR) 10Filippo Giunchedi: [C:03+1] Configure prometheus metrics on the cephosd cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [12:13:08] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1071582 (owner: 10Muehlenhoff) [12:14:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:15:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:17:29] (03CR) 10David Caro: [C:03+2] prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1071577 (owner: 10David Caro) [12:18:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:18:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:22:09] (03PS1) 10Brouberol: spark-operator: enable the definition of securitycontext.seccompProfile for spark containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071596 (https://phabricator.wikimedia.org/T369492) [12:22:56] (03CR) 10Stevemunene: [C:03+2] Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [12:23:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T370903)', diff saved to https://phabricator.wikimedia.org/P68755 and previous config saved to /var/cache/conftool/dbconfig/20240909-122321-ladsgroup.json [12:23:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:24:23] (03PS1) 10Muehlenhoff: Switch dragonfly-supernode2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1071599 (https://phabricator.wikimedia.org/T349619) [12:25:38] (03CR) 10Muehlenhoff: [C:03+2] Switch dragonfly-supernode2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1071599 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:28:31] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2431.codfw.wmnet [12:28:34] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [12:29:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2431.codfw.wmnet [12:29:38] kamila_: I'll merge your patch along [12:29:58] moritzm: thanks! [12:33:05] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2431 to wikikube-worker2104 [12:33:11] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [12:33:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [12:34:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:39] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2431 to wikikube-worker2104 - kamila@cumin1002" [12:37:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2431 to wikikube-worker2104 - kamila@cumin1002" [12:37:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:07] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2104 [12:37:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2104 [12:38:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2431 to wikikube-worker2104 [12:39:30] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2431 to wikikube-worker2104 completed: - mw2431 (**PAS... [12:39:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:40:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:41:13] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2104.codfw.wmnet on all recursors [12:41:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2104.codfw.wmnet on all recursors [12:43:16] !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2104.codfw.wmnet [12:43:27] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129663 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumbering for host wikikube-worker2104.codfw.wmnet [12:43:37] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2104.codfw.wmnet with OS bullseye [12:43:47] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2104 [12:43:49] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-worker2104.codfw.wmnet with OS bullseye [12:43:58] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [12:48:01] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2104 - kamila@cumin1002" [12:48:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2104 - kamila@cumin1002" [12:48:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:48:06] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2104.codfw.wmnet 61.16.192.10.in-addr.arpa 1.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:48:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2104.codfw.wmnet 61.16.192.10.in-addr.arpa 1.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:48:10] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2104 [12:48:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2104 [12:48:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2104 [12:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:54:49] jouncebot: nowandnext [12:54:50] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [12:54:50] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1300) [12:55:49] (03CR) 10Dreamy Jazz: Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz) [12:55:49] (03CR) 10Muehlenhoff: [C:03+2] Add cloudidm* to cloud-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1071578 (owner: 10Muehlenhoff) [12:56:04] Going to start early if that's okay. [12:56:40] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: introduce support for multiple flat networks [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [12:58:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz) [12:58:18] jouncebot: refresh [12:58:18] I refreshed my knowledge about deployments. [12:58:42] (03Merged) 10jenkins-bot: Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz) [12:59:08] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]] [12:59:12] T373021: Write to cuci_user table when CheckUser actions occur - https://phabricator.wikimedia.org/T373021 [12:59:16] seanleong-wmde: it looks like your change is listed 3 times on https://wikitech.wikimedia.org/wiki/Deployments - was it supposed to be just one, or three different changes? [12:59:30] (03PS1) 10Muehlenhoff: Add Cumin alias for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1071606 [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1300). [13:00:05] Dreamy_Jazz, James_F, jan_drewniak, hnowlan, and seanleong-wmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] \o [13:00:17] I don’t think I can deploy, sorry [13:01:12] Hey [13:01:22] \o [13:01:58] I'll deploy. [13:02:09] Currently deploying my config change btw [13:02:23] Oh, OK, I'll let Dreamy_Jazz deploy. [13:02:36] I am supposed to be in a meeting shortly, so don't mind either way [13:02:38] But at least I'll start the CI ball rolling. [13:02:40] (03CR) 10Jforrester: [C:03+2] tests: Disable all Beta Cluster CI testing, all failing [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242) (owner: 10Jforrester) [13:02:41] (03CR) 10Jforrester: [C:03+2] Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) (owner: 10Jforrester) [13:02:43] (03CR) 10Jforrester: [C:03+2] Use default width/height on gallery to avoid parser instance [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146) (owner: 10Jforrester) [13:03:40] (03PS8) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [13:03:51] PROBLEM - SSH on prometheus1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:03:52] MatmaRex Hi, it's suppose to be only 1, I accidentally double clicked when it was loading, sorry [13:04:24] (03CR) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) (owner: 10JMeybohm) [13:04:34] seanleong-wmde: no problem, just clarifying [13:04:45] I'll trim. [13:04:54] Thanks! [13:05:22] FIRING: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:21] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2104.codfw.wmnet with reason: host reimage [13:06:23] Got some errors when doing docker_pull_k8s [13:06:37] What kind of errors? [13:07:28] (03PS1) 10Muehlenhoff: Add Cumin aliases for moss Ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1071609 [13:07:37] 502 / unauthorised / EOF [13:07:55] 18 k8s nodes failed to pull the multiversion image [13:08:31] Will re-try [13:08:43] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]] [13:08:47] T373021: Write to cuci_user table when CheckUser actions occur - https://phabricator.wikimedia.org/T373021 [13:09:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2104.codfw.wmnet with reason: host reimage [13:10:43] FIRING: JobUnavailable: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:11:24] (03Merged) 10jenkins-bot: tests: Disable all Beta Cluster CI testing, all failing [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242) (owner: 10Jforrester) [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:43] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10129757 (10MoritzMuehlenhoff) [13:11:46] Running a second time worked. [13:12:59] (03Merged) 10jenkins-bot: Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) (owner: 10Jforrester) [13:13:35] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:13:37] (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2002 with kafka-main2007 [puppet] - 10https://gerrit.wikimedia.org/r/1071610 (https://phabricator.wikimedia.org/T363210) [13:13:59] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [13:15:27] (03PS16) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [13:16:23] Seeing some more errors with deployment [13:16:24] (03Merged) 10jenkins-bot: Use default width/height on gallery to avoid parser instance [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146) (owner: 10Jforrester) [13:17:42] (03PS9) 10JMeybohm: kafka/roll-restart-reboot-brokers: Add exclude and no-election options [cookbooks] - 10https://gerrit.wikimedia.org/r/1071559 (https://phabricator.wikimedia.org/T373189) [13:18:21] Deployment failed, it's rolling back [13:18:34] Oh dear. [13:19:04] (03CR) 10Elukey: "Exactly yes!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [13:19:05] Talking about the kubernetes configuration being group readable [13:19:57] Hmm. That should be OK? [13:20:17] Possibly there's a new k8s version being rolled out? [13:20:29] Also: [13:20:33] ```Error: UPGRADE FAILED: release canary failed, and has been rolled back due to atomic being set: cannot patch "mediawiki-canary-tls-proxy-certs" with kind Certificate: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.192.72.77:443: i/o timeout``` [13:21:04] That's a network miss? [13:21:25] the configuration being readable error can be ignored. The cert-manager error is a bit more concerning, trying to look at it now [13:21:27] I can't really inspect the output well because it dumped a lot of text when it errored [13:21:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:59] ^^ certificaterequests sounds like it's relevant. [13:22:11] Thanks hnowlan. [13:22:23] that's another namespace, hopefully unrelated [13:22:26] odd timing [13:22:29] so maybe not [13:22:33] Ack. [13:23:27] (03CR) 10Btullis: [C:03+1] "Cool, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071596 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:23:50] RECOVERY - SSH on prometheus1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:25:22] RESOLVED: [2x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1006:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:37] (03CR) 10Brouberol: [C:03+2] spark-operator: enable the definition of securitycontext.seccompProfile for spark containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071596 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:25:43] RESOLVED: JobUnavailable: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:53] Dreamy_Jazz: Want me to take over? I presume there's not really a validation step for the CU config change? [13:26:02] (03PS2) 10Clément Goubert: kubernetes: Rename two workers [puppet] - 10https://gerrit.wikimedia.org/r/1071583 (https://phabricator.wikimedia.org/T372878) [13:26:04] My meeting has finished, but don't mind. [13:26:10] There is no validation steps for the CU change [13:26:14] * James_F nods. [13:26:27] I've got two UBNs to deploy, so I have more skin in the game. :-) [13:26:36] Yeah... [13:26:41] (03PS1) 10Máté Szabó: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071613 [13:29:32] (03PS6) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) [13:29:38] (03CR) 10STran: [C:03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071613 (owner: 10Máté Szabó) [13:30:09] jouncebot: nowandnext [13:30:09] For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1300) [13:30:09] In 1 hour(s) and 59 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1530) [13:30:22] The window could probably be extended if necessary [13:30:25] (03PS7) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) [13:30:27] (03PS1) 10Ssingh: ntp: standardize the use of ntpsec in the configuration as well [puppet] - 10https://gerrit.wikimedia.org/r/1071616 [13:30:36] Yup. [13:31:08] Whilst we wait, I can at least do the Beta-only one. [13:31:18] Yeah. [13:31:23] (03CR) 10Ilias Sarantopoulos: "Done!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [13:31:37] (03PS5) 10Jforrester: [BETA CLUSTER] Add Web search experiment quickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:31:41] (03PS6) 10Jforrester: [BETA CLUSTER] Add Web search experiment quickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:31:43] (03CR) 10Jforrester: [C:03+2] [BETA CLUSTER] Add Web search experiment quickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:31:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3916/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071616 (owner: 10Ssingh) [13:32:14] (03PS1) 10Cathal Mooney: Manually define BGP neighbors for cephosd1*** Anycast BGP [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) [13:32:29] !log milimetric@deploy1003 Started deploy [airflow-dags/platform_eng@574f0de]: (no justification provided) [13:32:29] (03Merged) 10jenkins-bot: [BETA CLUSTER] Add Web search experiment quickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:32:39] (03CR) 10Máté Szabó: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071613 (owner: 10Máté Szabó) [13:32:55] !log milimetric@deploy1003 Finished deploy [airflow-dags/platform_eng@574f0de]: (no justification provided) (duration: 00m 26s) [13:33:36] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071613 (owner: 10Máté Szabó) [13:34:03] hnowlan: Any success in looking? Should I try to deploy again? [13:34:50] (03PS3) 10Jforrester: Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [13:34:53] !log sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1071616"' [13:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:55] (03PS4) 10Jforrester: Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [13:36:02] (03CR) 10Ssingh: [V:03+1 C:03+2] ntp: standardize the use of ntpsec in the configuration as well [puppet] - 10https://gerrit.wikimedia.org/r/1071616 (owner: 10Ssingh) [13:36:25] James_F: no joy so far, but worth retrying [13:36:29] Ack. [13:36:38] I'll re-try [13:36:53] Oh, sorry, already pulled the trigger. [13:36:54] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]], [[gerrit:1071253|tests: Disable all Beta Cluster CI testing, all failing (T374242)]], [[gerrit:1071254|Don't pass empty type/returnType to zobject lookup when undefined (T374199)]], [[gerrit:1071265|Use default width/height on gallery to avoid parser instance (T374146 [13:36:54] )]] [13:37:03] Np. [13:37:03] T373021: Write to cuci_user table when CheckUser actions occur - https://phabricator.wikimedia.org/T373021 [13:37:04] T374242: Beta Cluster orchestrator / evaluator broken, blocking WikiLambda CI (and use of Beta Cluster) - https://phabricator.wikimedia.org/T374242 [13:37:04] T374199: Identities and Types are missing from the Object selector results so References to them cannot be used in new tests or implementations - https://phabricator.wikimedia.org/T374199 [13:37:05] T374146: PHP Deprecated: Use of ImageGalleryBase::setHeights without parser was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\UploadWizard\CampaignPageFormatter::generateReadHtml] - https://phabricator.wikimedia.org/T374146 [13:37:27] Probably best to try them all at once anyway for time [13:37:35] !log mszabo@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [13:37:46] scap will do all of them at once regardless of the command. [13:37:49] So… yes. :-) [13:38:20] !log mszabo@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:39:26] !log mszabo@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:39:54] !log mszabo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:40:26] !log jforrester@deploy1003 dreamyjazz, jforrester: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]], [[gerrit:1071253|tests: Disable all Beta Cluster CI testing, all failing (T374242)]], [[gerrit:1071254|Don't pass empty type/returnType to zobject lookup when undefined (T374199)]], [[gerrit:1071265|Use default width/height on gallery to avoid parser instance (T374146) [13:40:27] ]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:39] !log mszabo@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [13:41:26] Well, damn. ssh disconnect and I wasn't in a screen. [13:41:56] !log mszabo@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [13:42:26] Uh oh [13:42:31] shouldn't that be no longer possible with https://phabricator.wikimedia.org/T361724 ? [13:42:43] I've done that before and it's not fun... [13:42:53] (03PS1) 10Ssingh: ntp: update path for driftfile [puppet] - 10https://gerrit.wikimedia.org/r/1071618 [13:43:55] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [13:43:58] jayme: And yet`/var/lib/scap/scap/bin/python3 /usr/bin/scap sync-world` is still running. [13:44:05] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3917/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071618 (owner: 10Ssingh) [13:44:24] I could kill the process and re-start? It'll be waiting for me to manually approve after the test-servers anyway. [13:44:26] (03CR) 10Ssingh: "(Do run PCC once to see the change in bird config :))" [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [13:44:58] (03CR) 10Ssingh: [V:03+1 C:03+2] ntp: update path for driftfile [puppet] - 10https://gerrit.wikimedia.org/r/1071618 (owner: 10Ssingh) [13:45:07] James_F: can't tell what happens when killing scap, sorry 🤷 [13:45:21] I'll do it, as there's nothing else I can do. [13:45:44] !log kill 4135240 # scap thread with no attached screen [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:48] * James_F sighs. [13:45:54] OK, let's go again, this time inside a screen. [13:45:59] but if it did not warn you about missing screen/tmux, maybe reopen the task :/ [13:46:09] (03PS2) 10Cathal Mooney: Manually define BGP neighbors for cephosd1*** Anycast BGP [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) [13:46:31] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]], [[gerrit:1071253|tests: Disable all Beta Cluster CI testing, all failing (T374242)]], [[gerrit:1071254|Don't pass empty type/returnType to zobject lookup when undefined (T374199)]], [[gerrit:1071265|Use default width/height on gallery to avoid parser instance (T374146 [13:46:31] )]] [13:46:38] T373021: Write to cuci_user table when CheckUser actions occur - https://phabricator.wikimedia.org/T373021 [13:46:38] T374242: Beta Cluster orchestrator / evaluator broken, blocking WikiLambda CI (and use of Beta Cluster) - https://phabricator.wikimedia.org/T374242 [13:46:39] T374199: Identities and Types are missing from the Object selector results so References to them cannot be used in new tests or implementations - https://phabricator.wikimedia.org/T374199 [13:46:39] T374146: PHP Deprecated: Use of ImageGalleryBase::setHeights without parser was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\UploadWizard\CampaignPageFormatter::generateReadHtml] - https://phabricator.wikimedia.org/T374146 [13:46:41] !log mszabo@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: sync [13:46:50] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [13:47:04] !log mszabo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: sync [13:47:12] (03PS1) 10Brouberol: airflow: add missing configuration allowing it to read connnections from disk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071619 (https://phabricator.wikimedia.org/T373026) [13:47:47] the mux requirement was merged but not enabled it seems :( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1066821 [13:47:49] (03PS2) 10Brouberol: airflow: add missing configuration allowing it to read connnections from disk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071619 (https://phabricator.wikimedia.org/T373026) [13:48:10] (03CR) 10Btullis: [C:03+1] airflow: add missing configuration allowing it to read connnections from disk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071619 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [13:48:21] (03PS3) 10Cathal Mooney: Manually define BGP neighbors for cephosd1*** Anycast BGP [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) [13:48:31] jayme: Commented; it's not yet enforced, per the task. [13:48:45] oof...okay, thanks [13:49:35] jouncebot: now and next [13:49:36] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1300) [13:49:44] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [13:49:55] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 11s) [13:50:23] godog: We'll likely go long with the backport window. [13:50:31] !log jforrester@deploy1003 dreamyjazz, jforrester: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]], [[gerrit:1071253|tests: Disable all Beta Cluster CI testing, all failing (T374242)]], [[gerrit:1071254|Don't pass empty type/returnType to zobject lookup when undefined (T374199)]], [[gerrit:1071265|Use default width/height on gallery to avoid parser instance (T374146) [13:50:31] ]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:35] (03PS3) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) [13:50:37] James_F: ack thanks, would you mind pinging me once done? [13:50:38] (03CR) 10Brouberol: [C:03+2] airflow: add missing configuration allowing it to read connnections from disk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071619 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [13:50:43] godog: Will do! [13:50:47] cheers [13:52:07] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [13:52:16] !log jforrester@deploy1003 dreamyjazz, jforrester: Continuing with sync [13:53:12] (03PS4) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) [13:54:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2104.codfw.wmnet with OS bullseye [13:54:41] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10129966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker2104.codfw.wmnet with OS bullseye co... [13:55:04] !log homer cr*codfw* commit 'T372878' [13:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:55:21] !log homer lsw1-b6-codfw* commit 'T372878' [13:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:58] OK, we've made it past the canaries. [13:56:23] (03CR) 10Btullis: [C:03+1] airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [13:56:45] (03CR) 10Brouberol: [C:03+2] airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [13:58:34] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071251|Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (T373021)]], [[gerrit:1071253|tests: Disable all Beta Cluster CI testing, all failing (T374242)]], [[gerrit:1071254|Don't pass empty type/returnType to zobject lookup when undefined (T374199)]], [[gerrit:1071265|Use default width/height on gallery to avoid parser instance (T37414 [13:58:34] 6)]] (duration: 12m 02s) [13:58:42] T373021: Write to cuci_user table when CheckUser actions occur - https://phabricator.wikimedia.org/T373021 [13:58:42] T374242: Beta Cluster orchestrator / evaluator broken, blocking WikiLambda CI (and use of Beta Cluster) - https://phabricator.wikimedia.org/T374242 [13:58:42] T374199: Identities and Types are missing from the Object selector results so References to them cannot be used in new tests or implementations - https://phabricator.wikimedia.org/T374199 [13:58:43] T37414: Debug parameters don't work as expected - https://phabricator.wikimedia.org/T37414 [13:59:33] (03PS1) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) [14:00:10] (03PS1) 10Btullis: Reduce airflow-analytics log retention from 90 to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/1071621 (https://phabricator.wikimedia.org/T370437) [14:00:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [14:00:19] OK! [14:00:25] Now for the quick config ones. [14:00:28] (03CR) 10Jforrester: [C:03+2] Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [14:00:31] (03PS3) 10Hnowlan: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) [14:00:33] (03CR) 10Jforrester: [C:03+2] Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:01:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3920/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071621 (https://phabricator.wikimedia.org/T370437) (owner: 10Btullis) [14:01:12] (03Merged) 10jenkins-bot: Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071566 (https://phabricator.wikimedia.org/T66315) (owner: 10Seanleong-wmde) [14:01:47] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2104.codfw.wmnet [14:01:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 333, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2104.codfw.wmnet [14:01:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2104.codfw.wmnet [14:02:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:02:09] Thanks for deploying my change [14:02:18] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130008 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by kamila@cumin1002 Renumbering for host wikikube-worker2104.codfw.wmnet com... [14:02:22] Dreamy_Jazz: Happy to help! [14:03:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:03:09] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10130013 (10kamila) [14:03:18] Hi, may I ask whether if my changes can be deployed today? Thanks! [14:04:03] seanleong-wmde: Do you mean the patch you scheduled for the current window? I'm deploying it now. [14:04:15] yup, okay, thanks! [14:04:38] (03PS4) 10Hnowlan: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) [14:04:41] (03CR) 10Jforrester: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:05:05] FF-only is such a pain sometimes. [14:05:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:05:22] (03Merged) 10jenkins-bot: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:05:33] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]] [14:05:38] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [14:05:38] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:06:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T371742)', diff saved to https://phabricator.wikimedia.org/P68756 and previous config saved to /var/cache/conftool/dbconfig/20240909-140623-ladsgroup.json [14:06:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:07:40] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox and not P{dns1004* or dns1005*} and A:dnsbox [14:08:18] !log jforrester@deploy1003 seanleong-wmde, jforrester, hnowlan: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:58] seanleong-wmde: Can you confirm on test servers that this fixes the issue? [14:09:14] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 415, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:37] Wikitech seems down? [14:09:56] WFM [14:10:05] I get the placeholder site. [14:10:08] WFM [14:10:10] what are you seeing, netwokr down, 500? [14:10:13] Hmm. [14:10:33] It's resolving to 208.80.154.224 locally, which seems right. [14:10:40] I see the placeholder too if I have the mwdebug extension turned on [14:10:44] but it's working otherwise [14:10:51] Oh! Yeah, that's what I'm doing wrongly. [14:10:56] False alarm, sorry all. [14:12:16] I think my change looks good on debug [14:12:22] hnowlan: Thanks! [14:13:20] seanleong-wmde: Or I can verify for you? It indeed moves the Wikidata item link back from the "other projects" section on fawiki etc. [14:13:24] !log jforrester@deploy1003 seanleong-wmde, jforrester, hnowlan: Continuing with sync [14:14:39] James_F do you mind verifying it for me? [14:14:42] Oh thank you so much [14:14:56] I can't find the test servers [14:15:53] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1071617/3922/cephosd1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [14:16:04] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:23] seanleong-wmde: Are you using https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage ? [14:16:47] (03CR) 10Ssingh: [C:03+1] "multihop => False and related changes in PCC, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [14:17:45] claime no, but I will try to set it up [14:17:50] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]] (duration: 12m 16s) [14:17:56] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [14:17:56] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:17:57] OK, we're finally done! [14:18:04] godog: All yours. [14:18:15] James_F: tyvm [14:18:39] (03CR) 10Filippo Giunchedi: [C:03+2] mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [14:18:44] James_F Thank you so much! [14:18:47] (03CR) 10Filippo Giunchedi: [C:03+2] mediawiki: port account creation failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071165 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [14:18:51] James_F: thanks for picking up where WMDE dropped the training ball :) [14:19:18] seanleong-wmde: that's what you need to reach the testservers, once installed in your browser, you can go to any wikimedia project, turn it on, then select k8s-mwdebug as a backend, and it will route your browser's requests to the mw-debug deployment of mediawiki [14:19:26] tarrow: No worries! [14:19:42] (03PS1) 10Arnaudb: mariadb: wipe pc1017 pc2017 [puppet] - 10https://gerrit.wikimedia.org/r/1071623 (https://phabricator.wikimedia.org/T374355) [14:20:56] PROBLEM - people.wikimedia.org requires authentication on people1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:20:59] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2428.codfw.wmnet [14:21:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P68757 and previous config saved to /var/cache/conftool/dbconfig/20240909-142131-ladsgroup.json [14:21:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2428.codfw.wmnet [14:21:44] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2429.codfw.wmnet [14:21:46] RECOVERY - people.wikimedia.org requires authentication on people1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:22:06] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2429.codfw.wmnet [14:22:38] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Unified pattern for RemoteHosts accessors in Spicerack - https://phabricator.wikimedia.org/T374073#10130120 (10elukey) p:05Triage→03Medium [14:22:41] claime Ahh, I'll try it out. Thanks! [14:22:51] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename two workers [puppet] - 10https://gerrit.wikimedia.org/r/1071583 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [14:23:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2428 to wikikube-worker2105 [14:25:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:26:14] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:27:06] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:27:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [14:27:46] (03PS2) 10NMW03: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) [14:27:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:18] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2428 to wikikube-worker2105 - cgoubert@cumin1002" [14:28:38] (03PS1) 10Ssingh: P:ntp: update check for configuration file changed [puppet] - 10https://gerrit.wikimedia.org/r/1071625 [14:29:14] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:29:47] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3923/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071625 (owner: 10Ssingh) [14:30:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:30:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2428 to wikikube-worker2105 - cgoubert@cumin1002" [14:30:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:21] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2105 [14:30:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2105 [14:31:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2428 to wikikube-worker2105 [14:31:23] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2428 to wikikube-worker2105 completed: - mw2428 (**PASS**) - ✔️ Downtime... [14:34:13] (03CR) 10EoghanGaffney: "I tested this by setting `workers: 4` and verified that the number of rest workers increased, so I don't think this is caused by one of th" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney) [14:34:33] (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: update check for configuration file changed [puppet] - 10https://gerrit.wikimedia.org/r/1071625 (owner: 10Ssingh) [14:34:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2429 to wikikube-worker2106 [14:34:59] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:35:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P68758 and previous config saved to /var/cache/conftool/dbconfig/20240909-143638-ladsgroup.json [14:36:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10130193 (10joanna_borun) p:05Triage→03Medium [14:36:58] (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (https://phabricator.wikimedia.org/T374233) (owner: 10Ilias Sarantopoulos) [14:37:18] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130196 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:38:54] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2429 to wikikube-worker2106 - cgoubert@cumin1002" [14:39:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2429 to wikikube-worker2106 - cgoubert@cumin1002" [14:39:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:17] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2106 [14:39:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2106 [14:40:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2429 to wikikube-worker2106 [14:40:17] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2429 to wikikube-worker2106 completed: - mw2429 (**PASS**) - ✔️ Downtime... [14:41:25] (03PS1) 10Hnowlan: Enable async uploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071628 (https://phabricator.wikimedia.org/T356241) [14:41:38] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130225 (10Clement_Goubert) Tested via `test-cookbook` on `mw2428` and `mw2429` and they seem to have been correctly remove... [14:42:02] (03CR) 10Cathal Mooney: [C:03+2] Manually define BGP neighbors for cephosd1*** Anycast BGP [puppet] - 10https://gerrit.wikimedia.org/r/1071617 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [14:43:28] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2105.codfw.wmnet wikikube-worker2106.codfw.wmnet on all recursors [14:43:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2105.codfw.wmnet wikikube-worker2106.codfw.wmnet on all recursors [14:44:12] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-restart-ntp (exit_code=97) rolling restart_daemons on A:dnsbox and not P{dns1004* or dns1005*} and A:dnsbox [14:44:49] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2105.codfw.wmnet [14:45:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2105.codfw.wmnet with OS bullseye [14:45:18] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130257 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumbering for host wikikube-worker2105.codfw.wmnet [14:45:22] (03PS1) 10Ilias Sarantopoulos: knative: change images ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071630 (https://phabricator.wikimedia.org/T374233) [14:45:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2105 [14:45:24] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2105.codfw.wmnet with OS bullseye [14:45:28] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:47:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130269 (10CDanis) >>! In T374272#10127785, @cmooney wrote: > Thanks @cdanis and @Southparkfan for the task! > > Logs relate to [[ https://n... [14:49:05] !log sudo cumin -b1 -s300 'A:dnsbox and A:edges' 'systemctl restart ntpsec.service' [14:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:31] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2105 - cgoubert@cumin1002" [14:49:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2105 - cgoubert@cumin1002" [14:49:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:36] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2105.codfw.wmnet 56.16.192.10.in-addr.arpa 6.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2105.codfw.wmnet 56.16.192.10.in-addr.arpa 6.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:39] (03CR) 10Muehlenhoff: "We already have the config-master nodes for this, though?" [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [14:49:39] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2105 [14:49:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2105 [14:49:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2105 [14:49:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2106.codfw.wmnet [14:50:01] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130275 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumbering for host wikikube-worker2106.codfw.wmnet [14:50:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2106.codfw.wmnet with OS bullseye [14:50:12] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130276 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2106.codfw.wmnet with OS bullseye [14:50:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2106 [14:50:27] (03PS2) 10Ilias Sarantopoulos: knative: change images ownership to ml [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071630 (https://phabricator.wikimedia.org/T374233) [14:51:17] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:51:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:51:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T371742)', diff saved to https://phabricator.wikimedia.org/P68759 and previous config saved to /var/cache/conftool/dbconfig/20240909-145145-ladsgroup.json [14:51:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:51:50] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:52:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130289 (10cmooney) >>! In T374272#10130269, @CDanis wrote: > The timestamps in the description come from LibreNMS's logs viewer for asw2-d-e... [14:53:17] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:53:30] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:53:50] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:53:58] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:54:17] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, outline of what was going wrong makes sense and fix seems logical. Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [14:54:38] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2106 - cgoubert@cumin1002" [14:54:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2106 - cgoubert@cumin1002" [14:54:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:42] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2106.codfw.wmnet 57.16.192.10.in-addr.arpa 7.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:54:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2106.codfw.wmnet 57.16.192.10.in-addr.arpa 7.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:54:46] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2106 [14:54:59] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130304 (10Clement_Goubert) Correction, it worked for `puppetdb`, but they got added back to `debmonitor`. Will investigate... [14:55:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2106 [14:55:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2106 [14:57:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1070592 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [14:59:45] (03PS1) 10Cathal Mooney: Add definition for cephosd hosts to map to Anycast BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1071633 (https://phabricator.wikimedia.org/T330153) [15:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:29] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130342 (10MoritzMuehlenhoff) >>! In T374351#10130304, @Clement_Goubert wrote: > Correction, it worked for `puppetdb`, but... [15:03:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker2095 - https://phabricator.wikimedia.org/T374258#10130345 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @hnowlan since it was unpingable anyway, I power cycled the server and reseated the cable b... [15:04:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10130363 (10Jhancock.wm) confirmed drive was issued. should arrive today or tomorrow. [15:06:06] (03CR) 10Cathal Mooney: [C:03+2] Add definition for cephosd hosts to map to Anycast BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1071633 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [15:06:49] FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:08:58] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye [15:09:05] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2095.codfw.wmnet with OS bullseye [15:09:16] !log installing imagemagick security updates [15:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:47] (03CR) 10Arnaudb: "Given where we are on this cookbook, I think this would be easier to do a second iteration. If you're ok with that, I'll create a separate" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [15:11:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:13:23] (03PS2) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) [15:13:23] (03PS1) 10Elukey: role::puppetmaster::frontend: add magru to the config-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1071637 [15:14:27] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 [15:14:54] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [15:14:56] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [15:15:55] (03CR) 10Hnowlan: Rebuild against latest package versions in bookworm: (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [15:17:00] (03PS3) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) [15:17:00] (03PS2) 10Elukey: role::puppetmaster::frontend: add magru to the config-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1071637 [15:20:38] (03PS1) 10Arnaudb: mariadb: productionize db2237 [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) [15:26:44] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2095.codfw.wmnet with reason: host reimage [15:26:47] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker2095.codfw.wmnet with reason: host reimage [15:27:08] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw2-d-eqiad with reason: repalce vcp link from d2 port 51 to d4 port 52 [15:27:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw2-d-eqiad with reason: repalce vcp link from d2 port 51 to d4 port 52 [15:27:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130497 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=81e99a80-f593-4494-a565-ea730a19fbc7) set by cmooney@cumin1002 fo... [15:27:57] (03CR) 10Ladsgroup: [C:03+2] mariadb: productionize db2237 [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [15:28:06] (03CR) 10Ladsgroup: [C:03+1] "oopsie" [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [15:28:11] sukhe: above was durum1001 - not sure if that is expected? [15:28:18] (bgp alert that is) [15:28:23] session is back up now [15:29:32] (03CR) 10Ladsgroup: "I honestly don't know how we wipe a server, is this the right way or not. Maybe Jaime would know?" [puppet] - 10https://gerrit.wikimedia.org/r/1071623 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [15:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1530). [15:30:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:55] topranks: not expected, was probably flapping again [15:30:56] checking [15:31:11] ok [15:31:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:27] there it is again [15:31:30] yeah [15:31:37] Sep 09 15:30:58 durum1001 bird[328110]: bgp2: Received: Unknown error 6.9: 060a [15:31:58] (03PS4) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) [15:31:58] (03PS3) 10Elukey: role::puppetmaster::frontend: add magru to the config-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1071637 [15:32:01] hmm [15:32:03] ping works [15:32:06] cmooney@re1.cr1-eqiad> ping 10.64.16.20 source 208.80.154.196 [15:32:06] PING 10.64.16.20 (10.64.16.20): 56 data bytes [15:32:06] 64 bytes from 10.64.16.20: icmp_seq=0 ttl=64 time=2.000 ms [15:32:06] 64 bytes from 10.64.16.20: icmp_seq=1 ttl=64 time=2.853 ms [15:32:21] !log restart bird on durum1001 [15:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:41] let's see [15:33:52] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130550 (10jijiki) [15:34:11] jouncebot: nowandnext [15:34:11] For the next 0 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1530) [15:34:11] In 1 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [15:34:11] In 1 hour(s) and 25 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [15:36:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:07] (03CR) 10JHathaway: [C:03+1] mx: Enable profile::auto_restarts::service for rspamd [puppet] - 10https://gerrit.wikimedia.org/r/1071564 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:37:19] topranks: you are already logged in to cr1-eqiad, is it durum1001? [15:37:46] (03CR) 10JHathaway: [C:03+1] P:mirrors::debian Export mirror age to textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [15:37:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2105.codfw.wmnet with reason: host reimage [15:38:18] (03CR) 10JHathaway: [C:03+2] vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [15:38:53] doh1002 hmm [15:39:06] Sep 09 14:24:12 doh1002 bird[1386067]: bfd1: Bad packet from 2620:0:861:ffff::1 - unknown session id (2784991672) [15:39:12] it flapped but still [15:39:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130595 (10cmooney) Ok link was replaced: ` Sep 9 15:36:56 asw2-d-eqiad vccpd[2257]: VCCPD_PROTOCOL_INTF_STATE_CHANGED: Member 4, interface... [15:40:03] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw2-d-eqiad [15:40:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw2-d-eqiad [15:41:04] PROBLEM - Host wikikube-worker2095 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130602 (10VRiley-WMF) Thank you! I appreciate it. Will be relabeling the new cable as 0325. Feel free to reach out if anything else happens. [15:41:13] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:41:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2106.codfw.wmnet with reason: host reimage [15:41:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2105.codfw.wmnet with reason: host reimage [15:42:44] RECOVERY - Host wikikube-worker2095 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [15:44:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2106.codfw.wmnet with reason: host reimage [15:44:45] (03CR) 10Elukey: "We do yes, but the following two links are mod-proxy/proxied to puppetmaster1001 atm:" [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [15:46:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2095.codfw.wmnet with OS bullseye [15:46:14] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2095.codfw.wmnet with OS bullseye completed: - wikikube-w... [15:47:11] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.7.0 plugin update for cephosd bgp - cmooney@cumin1002 [15:48:18] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2260.codfw.wmnet, mw2267.codfw.wmnet - https://phabricator.wikimedia.org/T374018#10130610 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:48:26] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2095.codfw.wmnet [15:48:46] !log hnowlan@cumin1002 END (ERROR) - Cookbook sre.k8s.pool-depool-node (exit_code=97) pool for host wikikube-worker2095.codfw.wmnet [15:49:18] (03PS5) 10Elukey: role::puppetserver: add profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) [15:49:18] (03PS4) 10Elukey: role::puppetmaster::frontend: add magru to the config-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1071637 [15:49:33] (03CR) 10JHathaway: [C:03+2] puppet8: avoid relying on g10k::config_file being defined [puppet] - 10https://gerrit.wikimedia.org/r/1071038 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [15:50:17] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3928/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [15:51:48] (03CR) 10Elukey: [V:03+1] "And from https://puppet-compiler.wmflabs.org/output/1071620/3928/puppetserver1001.eqiad.wmnet/index.html it seems that it really deploys a" [puppet] - 10https://gerrit.wikimedia.org/r/1071620 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [15:53:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.7.0 plugin update for cephosd bgp - cmooney@cumin1002 [15:58:58] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2105.codfw.wmnet with OS bullseye [16:02:02] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2105.codfw.wmnet with OS bullseye completed: - wikikube-... [16:03:38] !log homer cr*codfw* commit 'T372878' [16:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:41] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:04:03] !log homer lsw1-b6-codfw* commit 'T372878' [16:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [16:05:01] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s) [16:05:11] (03PS1) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [16:05:18] (03CR) 10CI reject: [V:04-1] flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [16:05:21] !log homer lsw1-b5-codfw* commit [16:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10130724 (10Papaul) @cmooney thanks for the feedback. The discussion about not using virtual-chassis was it a final decision or just something that i... [16:05:42] (03PS2) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [16:07:25] (03CR) 10Clément Goubert: [C:03+1] Enable async uploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071628 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [16:07:46] claime: you might catch a change for wikikube-worker2095 on your cr*codfw* run [16:07:54] hnowlan: ack [16:09:10] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2095.codfw.wmnet [16:09:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2095.codfw.wmnet [16:09:19] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2105.codfw.wmnet [16:09:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2105.codfw.wmnet [16:09:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2105.codfw.wmnet [16:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2106.codfw.wmnet with OS bullseye [16:11:05] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10130741 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:11:07] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130746 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-worker2105.codfw.wmnet completed: - wikikube-w... [16:11:08] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 329, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:10] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10130747 (10Dzahn) [16:11:16] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130753 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2106.codfw.wmnet with OS bullseye completed: - wikikube-... [16:11:22] (03CR) 10Alexandros Kosiaris: [C:03+1] Enable async uploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071628 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [16:11:27] hnowlan: fwiw, I didn't see either wikikube-worker2095 or kubernetes2031 in the run [16:11:40] ack - I think I might have hit it already last week [16:11:43] (at least for cr1) [16:12:29] jouncebot: nowandnext [16:12:29] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [16:12:29] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [16:12:29] In 0 hour(s) and 47 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [16:13:17] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [16:13:36] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:13:48] I'm going to deploy a config change affecting only test2wiki [16:14:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071628 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [16:14:57] (03Merged) 10jenkins-bot: Enable async uploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071628 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [16:15:02] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:15:08] !log hnowlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]] [16:15:13] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [16:16:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, parse2009.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2022.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2043 [16:16:00] mnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2314.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2098.codfw.wmnet, wikikube-worker2105.codfw.wmnet, kubernetes2013.codfw.wmnet, parse2012.codfw.wmnet, kubernetes203 [16:16:00] wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2048.codfw.wmnet, mw2412.codfw.wmnet, mw2426.codfw.wmnet, mw2371.codfw.wmnet, wikikube-worker2100 https://wikitech.wikimedia.org/wiki/PyBal [16:16:30] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10130776 (10jhathaway) [16:16:32] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10130780 (10jhathaway) [16:16:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 411, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:17:35] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [16:17:50] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2106.codfw.wmnet [16:17:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2106.codfw.wmnet [16:17:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2106.codfw.wmnet [16:18:15] (03CR) 10Jcrespo: "I see no obvious mistake here, but I am not familiar with the usual wmf db procedures. For example, I know this doesn't pool the host auto" [puppet] - 10https://gerrit.wikimedia.org/r/1071639 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [16:19:11] !log hnowlan@deploy1003 hnowlan: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:19:12] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130803 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-worker2106.codfw.wmnet completed: - wikikube-w... [16:20:11] !log cgoubert@cumin1002 START - Cookbook sre.debmonitor.remove-hosts for 2 hosts: mw[2428-2429].codfw.wmnet [16:20:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 2 hosts: mw[2428-2429].codfw.wmnet [16:20:24] 06SRE, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10130807 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by cgoubert: for 2 hosts: mw[2428-2429].codfw.wmnet [16:20:33] (03CR) 10Ssingh: puppet8: remove ssl_keystore_location, always set ssl_key_password (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [16:20:37] !log hnowlan@deploy1003 hnowlan: Continuing with sync [16:20:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180) (owner: 10Jforrester) [16:22:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [16:24:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10130820 (10RobH) IRC Update: All DC Ops related items are complete and Cathal is currently working with EQ to schedule a... [16:24:39] (03CR) 10Jcrespo: "Again, I may not know everything about the expected practices, but as long as this is just a function change, but it doesn't change hands " [puppet] - 10https://gerrit.wikimedia.org/r/1071623 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [16:25:13] (03CR) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [16:25:14] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:25:15] (03CR) 10Jcrespo: [C:03+1] mariadb: wipe pc1017 pc2017 [puppet] - 10https://gerrit.wikimedia.org/r/1071623 (https://phabricator.wikimedia.org/T374355) (owner: 10Arnaudb) [16:26:20] !log hnowlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]] (duration: 11m 11s) [16:26:24] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [16:31:42] (03PS2) 10Clément Goubert: sre.hosts.rename: Disable puppet and debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) [16:32:47] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10130924 (10jhathaway) [16:33:42] PROBLEM - Host mw2431 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:21] (03CR) 10Eevans: "I'm kind of confused here. To be clear: I'm (at least) as confused by the *before* state, as I am the *after*." [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [16:34:32] 06SRE, 06Infrastructure-Foundations, 10netops: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379 (10cmooney) 03NEW p:05Triage→03Low [16:35:03] (03PS1) 10Cathal Mooney: Change config for cephosd's in eqiad to peer with switch global addr [puppet] - 10https://gerrit.wikimedia.org/r/1071652 (https://phabricator.wikimedia.org/T374379) [16:37:38] (03CR) 10Brouberol: "The changes look good, but you're missing the module name and version from package.json and package.lock. You need https://gitlab.wikimedi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [16:38:12] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374380 (10Clement_Goubert) 03NEW [16:39:04] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.cod [16:39:04] , wikikube-worker2084.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2337.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmn [16:39:04] 27.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2039.codfw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2038.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2419.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [16:39:20] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2079.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikiku [16:39:20] r2036.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, parse2020.codfw.wmnet, wikikube-w [16:39:20] 7.codfw.wmnet, wikikube-worker2082.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker209 https://wikitech.wikimedia.org/wiki/PyBal [16:39:27] (03PS3) 10Clément Goubert: sre.hosts.rename: Disable puppet and debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) [16:39:36] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10130945 (10Ladsgroup) a:05andrea.denisse→03Ladsgroup This week's clinic duty taking over. Waiting for NDA confirmation. [16:40:13] (03CR) 10Cathal Mooney: [C:03+2] Change config for cephosd's in eqiad to peer with switch global addr [puppet] - 10https://gerrit.wikimedia.org/r/1071652 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [16:40:39] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10130953 (10Ladsgroup) a:05andrea.denisse→03Ladsgroup This week's clinic... [16:41:04] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:41:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:42:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:42:37] (03PS1) 10Ebernhardson: cirrus: Fix cloudelastic saneitizer, and enable private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071654 [16:46:00] (03CR) 10Ssingh: puppet8: remove ssl_keystore_location, always set ssl_key_password (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [16:46:14] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 3 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:46:16] (03CR) 10DCausse: "should the chart version be updated?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [16:47:06] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:48:27] (03CR) 10DCausse: [C:03+1] cirrus: Fix cloudelastic saneitizer, and enable private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071654 (owner: 10Ebernhardson) [16:49:14] (03PS2) 10Stoyofuku-wmf: Release donate link to pilot wikis (French Wikipedia and Wikifunctions) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) [16:49:14] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:52:20] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Manuel Merz (WMDE) out of all services on: 677 hosts [16:52:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Manuel Merz (WMDE) out of all services on: 677 hosts [16:53:03] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Manuel Merz (WMDE) out of all services on: 1552 hosts [16:53:41] (03CR) 10RLazarus: [C:03+2] sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [16:54:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Manuel Merz (WMDE) out of all services on: 1552 hosts [16:55:54] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10131007 (10Papaul) [16:56:57] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10131008 (10Jhancock.wm) mw2379 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/results/89809/ [16:57:38] (03CR) 10Ebernhardson: [C:03+2] cirrus: Fix cloudelastic saneitizer, and enable private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071654 (owner: 10Ebernhardson) [16:58:01] (03PS2) 10Muehlenhoff: Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 [16:58:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10131011 (10Jhancock.wm) mw2431 is causing an alert in netbox https://netbox.wikimedia.org/extras/scripts/results/89809/ [16:58:33] (03Merged) 10jenkins-bot: cirrus: Fix cloudelastic saneitizer, and enable private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071654 (owner: 10Ebernhardson) [16:58:37] (03CR) 10Muehlenhoff: Rebuild against latest package versions in bookworm: (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [16:59:49] (03PS3) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [16:59:57] (03CR) 10CI reject: [V:04-1] flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [17:00:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700). [17:00:13] (03PS4) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [17:01:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:02:03] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:02:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:02:07] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:03:08] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:03:59] (03PS1) 10Cathal Mooney: Revert "Manually define BGP neighbors for cephosd1*** Anycast BGP" [puppet] - 10https://gerrit.wikimedia.org/r/1071656 [17:04:10] (03CR) 10CI reject: [V:04-1] Revert "Manually define BGP neighbors for cephosd1*** Anycast BGP" [puppet] - 10https://gerrit.wikimedia.org/r/1071656 (owner: 10Cathal Mooney) [17:04:15] (03Abandoned) 10Cathal Mooney: Revert "Manually define BGP neighbors for cephosd1*** Anycast BGP" [puppet] - 10https://gerrit.wikimedia.org/r/1071656 (owner: 10Cathal Mooney) [17:05:35] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add nes frack firewalls - pt1979@cumin2002" [17:05:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add nes frack firewalls - pt1979@cumin2002" [17:05:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:07:57] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:08:01] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:08:05] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [17:08:15] (03PS5) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [17:09:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, mw2396.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2060.codfw.w [17:09:22] 2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2055.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2451.codfw.wmnet, wikikube-worker2105.codfw.wmnet, wikikube-worker2014.codfw.wmnet, kubernetes2056.codfw.wmnet, mw2399.codfw.wmnet, wikikube-worker2101.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2087.codfw.wmnet, wikikube-worker2056.codfw.wmnet, wikikube-wo [17:09:22] .codfw.wmnet, parse2014.codfw.wmnet, wikikube-worker2104.codfw.wmnet, kubernetes2021.codfw.wmnet, wikikube-worker2037.codfw.wmnet, mw2450.codfw.wmnet, wikikube-worker2100.codfw.wmnet, p https://wikitech.wikimedia.org/wiki/PyBal [17:10:04] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2056.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2091.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-wo [17:10:04] .codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2014.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2045.codfw.wmnet, wikikube-worker2105.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2087.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2301.codfw.wmnet, mw2416.codfw.wmnet, wikikube-worker2049.codfw.wmnet, parse2014.codfw.wmnet, mw2373.codfw.wmnet, wikikube-w [17:10:04] 7.codfw.wmnet, wikikube-worker2012.codfw.wmnet, mw2450.codfw.wmnet, wikikube-worker2100.codfw.wmnet, wikikube-worker2085.codfw.wmnet, wikikube-worker2080.codfw.wmnet, mw2445.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal [17:10:24] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:11:04] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:11:25] (03CR) 10Bking: "Oops, sorry I missed those steps. Should be fixed now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [17:12:23] (03PS1) 10Hnowlan: Enable Copyupload-allowed-domain on testwiki, disable on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) [17:15:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1003 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:15:39] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:15:53] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:21:32] (03PS1) 10Dwisehaupt: icinga: add monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071664 (https://phabricator.wikimedia.org/T369937) [17:21:53] (03PS1) 10Cathal Mooney: Revert cephosd Bird config to peer with switch link-local IPs [puppet] - 10https://gerrit.wikimedia.org/r/1071665 (https://phabricator.wikimedia.org/T374379) [17:22:33] jouncebot: nowandnext [17:22:33] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [17:22:33] For the next 0 hour(s) and 7 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T1700) [17:22:33] In 2 hour(s) and 37 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2000) [17:23:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071266 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [17:23:37] (03PS6) 10Bking: flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) [17:23:46] I'll slip out a logging fix. [17:24:11] (03PS2) 10Dwisehaupt: icinga: add monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071664 (https://phabricator.wikimedia.org/T369937) [17:26:33] (03CR) 10Cathal Mooney: [C:03+2] Revert cephosd Bird config to peer with switch link-local IPs [puppet] - 10https://gerrit.wikimedia.org/r/1071665 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [17:28:12] (03CR) 10Dzahn: [C:03+2] icinga: add monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071664 (https://phabricator.wikimedia.org/T369937) (owner: 10Dwisehaupt) [17:29:10] (03CR) 10Bking: flink-app/rdf-streaming-updater: add calico network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [17:31:43] (03Merged) 10jenkins-bot: ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071266 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [17:31:52] (03CR) 10BCornwall: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1071606 (owner: 10Muehlenhoff) [17:31:55] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1071266|ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash (T374241)]] [17:31:58] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [17:32:04] (03CR) 10Dzahn: [C:03+2] "ran puppet on alert2001. you should see the new hosts on icinga now as PENDING" [puppet] - 10https://gerrit.wikimedia.org/r/1071664 (https://phabricator.wikimedia.org/T369937) (owner: 10Dwisehaupt) [17:32:10] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:32:16] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:49] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1071266|ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash (T374241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:36:14] (03PS1) 10Ladsgroup: tables-catalog: Add rest of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071667 (https://phabricator.wikimedia.org/T363581) [17:39:21] (03CR) 10CI reject: [V:04-1] tables-catalog: Add rest of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071667 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [17:39:26] !log jforrester@deploy1003 jforrester: Continuing with sync [17:41:30] (03PS2) 10Ladsgroup: tables-catalog: Add rest of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071667 (https://phabricator.wikimedia.org/T363581) [17:44:00] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071266|ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash (T374241)]] (duration: 12m 05s) [17:44:03] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [17:44:41] (03PS5) 10Scott French: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) [17:44:41] (03PS5) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) [17:44:41] (03PS5) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) [17:44:42] (03PS5) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) [17:45:03] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add rest of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071667 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [17:46:29] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10131231 (10KFrancis) Hi all, I'm confirming the NDA has been signed. Thanks! [17:50:10] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:51:16] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:10] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:54:07] (03CR) 10Scott French: [C:03+1] Enable Copyupload-allowed-domain on testwiki, disable on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071659 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [17:55:21] (03PS3) 10Jdlrobson: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) [17:57:02] (03CR) 10Brouberol: [C:03+1] flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [18:15:17] (03CR) 10Jdlrobson: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [18:16:28] (03PS1) 10Brouberol: datahub-gms: create a Service to allow inter-kube communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071673 (https://phabricator.wikimedia.org/T374384) [18:16:30] (03PS1) 10Brouberol: airflow: enable testing external connections from the CLI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071674 (https://phabricator.wikimedia.org/T374384) [18:16:31] (03PS1) 10Brouberol: airflow-test-k8s: integrate directly with the datahub REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071675 (https://phabricator.wikimedia.org/T374384) [18:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:21:54] (03PS1) 10Cathal Mooney: Use global unicast to peer from cephosd but allow LL for BFD in [puppet] - 10https://gerrit.wikimedia.org/r/1071677 (https://phabricator.wikimedia.org/T374379) [18:25:34] (03CR) 10Dwisehaupt: "Thanks. We are seeing them come on through." [puppet] - 10https://gerrit.wikimedia.org/r/1071664 (https://phabricator.wikimedia.org/T369937) (owner: 10Dwisehaupt) [18:27:47] (03CR) 10Cathal Mooney: "diffs here: https://puppet-compiler.wmflabs.org/output/1071677/3929/" [puppet] - 10https://gerrit.wikimedia.org/r/1071677 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [18:28:48] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10131435 (10Dwisehaupt) 05Open→03Resolved Both hosts are built and set to monitored. frdb2005 awaiting space in C8 for it's final location. [18:32:46] (03CR) 10Ssingh: [C:03+1] "I won't pretend to understand the extent of the issue outlined in the commit message but otherwise looks good and restricted that I don't " [puppet] - 10https://gerrit.wikimedia.org/r/1071677 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [18:35:50] (03CR) 10JHathaway: "Yeah it is confusing, the error comes from looking up `profile::cassandra::settings` in the cassandra profile. In the case of ml_cache the" [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [18:35:57] (03CR) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [18:37:26] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10131477 (10cmooney) Ok through trial and error it would appear the issue is something to do with the switch not dealing well wi... [18:38:46] (03CR) 10Ssingh: [C:03+1] "Thanks for the clarification. Should be a NOOP but will roll it out tomorrow morning." [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [18:39:47] ACKNOWLEDGEMENT - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 Cathal Mooney Issue with BFD on link-local for cephosd - see T374379 - The acknowledgement expires at: 2024-09-10 18:39:20. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:09] ACKNOWLEDGEMENT - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 1 Cathal Mooney Issue with BFD on link-local for cephosd - see T374379 - The acknowledgement expires at: 2024-09-10 18:40:02. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:48:09] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T374386 (10EChukwukere-WMF) 03NEW [18:48:16] (03CR) 10Cathal Mooney: [C:03+2] Add new global IPv6 private range to base firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/1070592 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [18:49:49] (03CR) 10Bking: [C:03+2] flink-app/rdf-streaming-updater: add calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [18:51:10] (03CR) 10Ebernhardson: [C:03+1] "This should be ready for deploy, i can't fit it into my schedule today but should be able to deploy it tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [18:52:45] (03CR) 10Ssingh: [C:03+1] "I noticed this in the PCC output for I363327e15b8faef582709bf13c9b5a2167fdd384. Taking back my comments on IRC, I think this is fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/1065286 (https://phabricator.wikimedia.org/T366900) (owner: 10JHathaway) [18:53:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10131535 (10cmooney) So far things seem stable with this. I will leave task open to review as the week goes on, also considering if we need t... [19:01:01] (03PS4) 10Jbond: P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) [19:01:06] (03CR) 10JHathaway: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [19:03:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:03:16] (03PS3) 10Jbond: realm.pp: drop $other_site global [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) [19:05:05] Hey ops, is there anyone able/willing to help me run a maintenance script as part of a config deploy today? [19:05:18] (03CR) 10JHathaway: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [19:06:08] (context: moswiki is about to come out of incubator, before that happens MOS needs to become a namespace on a number of wikis, that will require running namespaceDupes.php and maybe cleanupTitles.php after deploying a config patch) [19:08:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:08:29] cscott: we run namespaceDupes routinely and we've recently fixed cleanupTitles as well, so whoever is deploying should be able to run them for you. or if you're the one deploying, i can look up the commands for you [19:13:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:15:11] (03PS1) 10Jforrester: ZObjectStructureValidator::validate: use set_time_limit() to limit in the case of run-away JsonSchema [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071690 (https://phabricator.wikimedia.org/T374241) [19:15:58] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1071638 (owner: 10Muehlenhoff) [19:16:01] jouncebot: nowandnext [19:16:01] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [19:16:01] In 0 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2000) [19:16:12] I'll slip out a UBN hopeful-fix. [19:16:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071690 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [19:16:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [19:17:53] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1061 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:18:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:20:31] jouncebot: next [19:20:31] In 0 hour(s) and 39 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2000) [19:24:10] (03CR) 10Dzahn: [V:04-1] "also have to turn the data type into "String OR Array of Strings" in the profile class. https://puppet-compiler.wmflabs.org/output/1071028" [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:24:31] * James_F paints go-faster stripes on CI. [19:24:34] MatmaRex: ok, I scheduled the backport for next hour [19:24:55] (03Merged) 10jenkins-bot: ZObjectStructureValidator::validate: use set_time_limit() to limit in the case of run-away JsonSchema [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071690 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [19:25:10] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1071690|ZObjectStructureValidator::validate: use set_time_limit() to limit in the case of run-away JsonSchema (T374241)]] [19:25:13] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [19:25:19] cscott: For MOS:? [19:25:27] * James_F shudders [19:25:44] cscott: i found the docs at https://wikitech.wikimedia.org/wiki/Adding_Namespaces#Deployment , they look up-to-date [19:26:11] (03CR) 10AOkoth: [C:03+2] vrts: run install script on new server [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [19:26:11] (03PS6) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [19:26:46] cscott: ah, you're skipping enwiki for now? [19:27:11] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1071690|ZObjectStructureValidator::validate: use set_time_limit() to limit in the case of run-away JsonSchema (T374241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:28:57] !log jforrester@deploy1003 jforrester: Continuing with sync [19:28:58] (03CR) 10Dzahn: [V:03+1 C:03+1] "working now: https://puppet-compiler.wmflabs.org/output/1071028/3932/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:29:17] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:30:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.042 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:30:34] (03CR) 10Dzahn: [C:03+2] phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:32:09] (03PS1) 10AOkoth: vrts: fix test command path [puppet] - 10https://gerrit.wikimedia.org/r/1071693 (https://phabricator.wikimedia.org/T373420) [19:33:17] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:33:29] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071690|ZObjectStructureValidator::validate: use set_time_limit() to limit in the case of run-away JsonSchema (T374241)]] (duration: 08m 19s) [19:33:32] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [19:33:56] (03CR) 10Dzahn: [C:03+1] "thanks Moritz! @Arnold we can do this together, this time it should work" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:34:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:37:48] (03CR) 10AOkoth: [C:03+2] vrts: fix test command path [puppet] - 10https://gerrit.wikimedia.org/r/1071693 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [19:37:59] (03CR) 10Dzahn: [C:03+2] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:45:07] MatmaRex: yes, starting slow [19:47:53] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1061 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:49:45] James_F: yes [19:51:00] Ideal is to do enwiki in a couple of days, but check for unexpected problems on the smaller wikis first before tackling the 2000-ish MOS: pages on enwiki [19:51:18] Ack. [19:55:18] 06SRE, 06Infrastructure-Foundations, 10netops: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392 (10cmooney) 03NEW p:05Triage→03Medium [19:59:17] (03PS6) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [19:59:25] (03CR) 10Brouberol: [C:03+1] kafka-main: Replace kafka-main2002 with kafka-main2007 [puppet] - 10https://gerrit.wikimedia.org/r/1071610 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [19:59:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:56] (03CR) 10Brouberol: [C:03+1] Reduce airflow-analytics log retention from 90 to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/1071621 (https://phabricator.wikimedia.org/T370437) (owner: 10Btullis) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2000). [20:00:04] physikerwelt, toyofuku, Nemoralis, Jdlrobson, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] o/ [20:00:42] i'm here. as I noted above ^ in backlog, my patch (1070975) requires maintenance scripts to run after it is deployed [20:00:48] Hey, let me take a look [20:01:19] MatmaRex I think volunteered to do the maintenance script part maybe? [20:01:19] I'll self-deploy my patch and Jdlrobson 's if it's any easier for you [20:01:44] i'm not a deployer. i can guide whoever is deploying if you need help [20:01:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:02:31] o/ [20:03:29] OK, I'll run the deploy. Give me a moment to plan the order [20:06:03] Is physikerwelt here or anyone to speak to "[config] 1071037 (deploy commands) Enable native MathML by default on group0 - task T373703" [20:06:03] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [20:07:13] In that case I'm going to deploy toyofuku, Nemoralis, and Jdlrobson first [20:07:39] thank you!! [20:07:42] Second I'll deploy cscott's. I'll need your help for the maintenance script part MatmaRex [20:07:52] my patch is connected to 2 other patches (one in core, one in Scribunto). Is it possible to merge them too? [20:09:42] kindrobot: heads up that [1.43.0-wmf.21] 1071202 (deploy commands) Fix typo in browser vendor prefix is impacting users and since it's a backport, might take a little longer to merge. [20:10:11] Nemoralis: most likely not. What are the patches? [20:10:19] Jdlrobson: understood [20:11:34] kindrobot: https://phabricator.wikimedia.org/T367009 [20:13:20] * kindrobot looks [20:13:55] (03PS1) 10Andrea Denisse: alert: Make alert2002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) [20:15:18] (03PS1) 10Andrea Denisse: alert: Make alert1002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) [20:15:55] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for EChukwukere-WMF - https://phabricator.wikimedia.org/T374386#10131835 (10Nemoralis) [20:16:34] Nemoralis: so to be clear, all of those have to be deployed together? [20:18:08] I don't think it matters, they will be automatically deployed anyway [20:18:48] (03CR) 10Herron: [C:03+1] alert: Make alert1002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [20:19:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:19:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180) (owner: 10Jforrester) [20:19:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [20:19:12] (03CR) 10Herron: [C:03+1] alert: Make alert2002 the active host for corto [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [20:19:42] I've started toyofuku and Jdlrobson [20:19:56] (03Merged) 10jenkins-bot: Release donate link to pilot wikis (French Wikipedia and Wikifunctions) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:21:13] 🤘 [20:21:52] Jdlrobson: could you fix merge conficts: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070354 [20:22:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180) (owner: 10Jforrester) [20:22:34] Jdlrobson so it wasn't just me who heard the rolling stones in their head after what kindrobot said? [20:23:51] Currently merging: 1071281 and 1071202 [20:24:24] I'll rebase for Jdlrobson [20:24:38] (03PS4) 10Jdlrobson: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) [20:25:01] Nemoralis: so merging in just 1070347 will be for today? [20:25:09] toyofuku: ack [20:25:18] Should be all set now! [20:25:34] kindrobot: yes [20:26:50] OK, then I'll deploy 1070347 ( Nemoralis ) and 1070354 ( Jdlrobson ) next. I'll do 1070975 ( cscott ) after if time permits [20:30:06] the maintenance scripts should run very quickly on the affected wikis, but there are 10 of them, so that's 10 separate --wiki=xxxwiki commands to run. [20:30:22] "them" = 10 affected wikis [20:31:44] MatmaRex: the needed command is in the commit message, but you have to prefix `mwscript` and add the appropriate `--wiki=...` arguments -- and I'm not 100% convinced by the order of arguments on https://wikitech.wikimedia.org/wiki/Adding_Namespaces#Deployment because I would think that `--wiki` is an argument to `mwscript` not an argument to `namespaceDupes.php` [20:34:00] I will need the exact ten commands to run, perhaps as a comment on the task or the patch [20:35:04] cscott: it is actually path first and --wiki second, whether it makes sense or not. run `mwscript --help` or read here: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/6d2af1f000440334e4007d6a59405b5495d17a31/multiversion/MWScript.php#27 or see recent examples in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:42] (and --wiki must be exactly the second parameter too) [20:37:35] cscott: the commands given in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070975 look correct, just replace `php maintenance/run.php namespaceDupes.php` with `mwscript namespaceDupes.php --wiki=…` (and likewise with cleanupTitles.php) [20:39:23] (03CR) 10DCausse: flink-app/rdf-streaming-updater: add calico network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071648 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [20:43:25] ok, I'll put the full list in a comment on phab so are editable if I screw up and also so they are easily available when we do the next step patch with enwiki in a few days. [20:43:49] (03CR) 10Muehlenhoff: "Not yet; https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070273 needs to be merged first." [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:44:31] probably alternate namespaceDupes.php and cleanupTitles.php so we don't leave things broken too long? [20:46:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2052.codfw [20:46:27] kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, mw2394.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codf [20:46:27] wikikube-worker2090.codfw.wmnet, wikikube-worker2062.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2050.codfw.wmnet, mw2440.codfw.wmnet, wikikube-worker2098.codfw.wmnet, mw2451.codfw https://wikitech.wikimedia.org/wiki/PyBal [20:46:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2081.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014. [20:46:35] net, kubernetes2048.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codf [20:46:35] parse2013.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2449.codfw.wmnet, mw2397.codfw.wmnet, wikikube-worker2050.codfw.wmnet, mw2356.codfw.wmnet, kuberne https://wikitech.wikimedia.org/wiki/PyBal [20:48:15] (03PS1) 10Bking: flink-app/rdf-streaming-updater: remove rdf-specific changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071706 (https://phabricator.wikimedia.org/T373195) [20:48:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:48:31] cscott: it's looking like we won't have time to deploy yours in this window, but maybe that's for the best, to get a chance to hammer down the maintenance scripts [20:48:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:48:59] (03Merged) 10jenkins-bot: Fix typo in browser vendor prefix [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180) (owner: 10Jforrester) [20:49:10] !log kindrobot@deploy1003 Started scap sync-world: Backport for [[gerrit:1071281|Release donate link to pilot wikis (French Wikipedia and Wikifunctions) (T373585)]], [[gerrit:1071202|Fix typo in browser vendor prefix (T374180)]] [20:49:15] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:49:15] T374180: Codex icons not loading on Vector 2022 via Chrome - https://phabricator.wikimedia.org/T374180 [20:51:06] !log kindrobot@deploy1003 jforrester, toyofuku, kindrobot: Backport for [[gerrit:1071281|Release donate link to pilot wikis (French Wikipedia and Wikifunctions) (T373585)]], [[gerrit:1071202|Fix typo in browser vendor prefix (T374180)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:51:19] 1071281 and 1071202 are ready for toyofuku and Jdlrobson [20:51:26] Testing mine now! [20:51:34] s/ready for/ready for testing [20:51:42] testing [20:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:53:15] 1071202 is working kindrobot please sync [20:53:19] I'm getting a few "Sep 9, 2024 @ 20:51:05.408 mediawiki RevisionStore WARNING Could not load user for revision 1" warnings. Related or worrisome? [20:53:46] Looking good for me! [20:53:49] MatmaRex: could you check https://phabricator.wikimedia.org/T363538#10131953 and see if I got it right? [20:54:42] (03CR) 10C. Scott Ananian: "Maintenance script commands to run after deploy are listed in:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [20:54:50] toyofuku Jdlrobson is this worrisome: mediawiki RevisionStore WARNING Could not load user for revision 1 [20:55:19] I don't thiiiiink that has anything to do with my change [20:55:27] Do you have a link to the error? [20:55:38] (03PS2) 10Bking: flink-app/rdf-streaming-updater: remove rdf-specific changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071706 (https://phabricator.wikimedia.org/T373195) [20:56:09] "Could not load user for revision 1" is not a new error: https://logstash.wikimedia.org/goto/407c71eb9b3b6ec006f112189580a9ff [20:56:28] Unfortunately, my hotel wifi is broken, but I have now figured joining with mobile data. Would someone be willing to deploy https://gerrit.wikimedia.org/r/c/1071037/ or should we postpone it? [20:56:29] or a warning, rather [20:56:53] (03PS12) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [20:56:53] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1055493/3934/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:57:11] jouncebot: nowandnext [20:57:11] For the next 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2000) [20:57:12] In 0 hour(s) and 2 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2100) [20:57:18] OK. Merging [20:57:21] !log kindrobot@deploy1003 jforrester, toyofuku, kindrobot: Continuing with sync [20:57:29] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1055493/3934/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:57:44] Unfortunately that'll bring us to the end of the window. [20:59:20] 06SRE, 10Ganeti, 13Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904#10131982 (10BCornwall) [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240909T2100). [21:00:27] cscott: re https://phabricator.wikimedia.org/T363538#10131953, remove the "maintenance/run.php" from the commands, otherwise looks good. [21:01:35] cscott: so namespaceDupes.php's --source-pseudo-namespace parameter is apparently case-sensitive, and that's why you're following up with cleanupTitles.php to take care of titles with other letter case? [21:01:44] physikerwelt: we won't be able to deploy it in this window, please reschedule it for an upcoming window. Sorry [21:01:49] !log kindrobot@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071281|Release donate link to pilot wikis (French Wikipedia and Wikifunctions) (T373585)]], [[gerrit:1071202|Fix typo in browser vendor prefix (T374180)]] (duration: 12m 39s) [21:01:53] kindrobot: looks like a Wikibase error (Could not load user for revision 1) [21:01:55] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [21:01:55] T374180: Codex icons not loading on Vector 2022 via Chrome - https://phabricator.wikimedia.org/T374180 [21:02:18] !log finished UTC late backport window. BACKPORTED: 1071281, 1071202, NOT backported (ran out of time): 1071037, 1070347, 1070354, 1070975 [21:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:21] cscott: FYI, cleanupTitles.php scans the whole `page` table and so it will take a couple of minutes on enwiki. no big deal, we ran it on all wikis recently, just making a note so that you don't worry when it's not instant. [21:02:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, parse20 [21:02:27] .wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker [21:02:27] fw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2038.codfw.wmnet, wikikube-worker2097.codfw.wmnet, mw2359.codfw.wmnet, mw2313.codfw.wmnet, kubernetes2013.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [21:02:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2050.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2338.codfw.wmnet, kuberne [21:02:35] codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, mw2440.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2043.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codf [21:02:35] wikikube-worker2062.codfw.wmnet, mw2449.codfw.wmnet, mw2397.codfw.wmnet, mw2413.codfw.wmnet, mw2356.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2451.codfw.wm https://wikitech.wikimedia.org/wiki/PyBal [21:02:37] Thank you everyone, sorry we didn't get to every patch [21:03:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:03:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:03:39] kindrobot: thank you [21:04:05] is there a button to reschedule? [21:05:09] physikerwelt: no special button, just add it again to another window, most likely using https://schedule-deployment.toolforge.org/ [21:05:36] so my patch is not deployed [21:05:48] (03PS18) 10Cwhite: ci: define statsd prometheus exporter mappings for zuul [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) [21:05:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [21:08:07] kindrobot: thank you. I think I made it. One can not see that this is attempt two [21:08:35] (03CR) 10Ryan Kemper: [C:03+1] flink-app/rdf-streaming-updater: remove rdf-specific changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071706 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [21:09:20] physikerwelt: great! Yeah, I don't believe there's any tooling for detecting/reporting that this is a second attempt :( [21:10:38] (03PS1) 10Jdlrobson: Support new heading layout [extensions/QuickSurveys] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071708 (https://phabricator.wikimedia.org/T374377) [21:10:44] (03PS2) 10Jdlrobson: Support new heading layout [extensions/QuickSurveys] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071708 (https://phabricator.wikimedia.org/T373039) [21:10:54] (03PS3) 10Jdlrobson: Support new heading layout [extensions/QuickSurveys] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071708 (https://phabricator.wikimedia.org/T373039) [21:10:58] MatmaRex: on the gerrit patch there was a comment that, "Note enwiki also has a number of pages with title Talk:MOS:xxx. They can not be fixed by namespaceDupes.php, but can be fixed by cleanupTitles.php, which also requires a talk namespace being set up." [21:11:26] MatmaRex: I think they actually *are* fixed by namespaceDupes.php, but it seemed like running cleanupTitles.php as well wouldn't hurt? [21:11:57] it won't hurt [21:12:15] (03CR) 10Herron: [C:03+1] ci: define statsd prometheus exporter mappings for zuul [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [21:13:59] MatmaRex: ok, recorded all that in the phab task and fixed the commands, so hopefully that will all be ready for attempt #2, whenever that is. [21:14:36] cscott: in general i think you don't need to worry as much about this. mediawiki copes with having funny titles in the database just fine, we've had hundreds of them for years until like a month ago, when pppery decided to fix them [21:14:56] https://phabricator.wikimedia.org/T196088 [21:15:44] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:15:48] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:17:57] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:18:00] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:19:27] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:19:28] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:19:34] kindrobot: one can see it in the ticket if the deployment bot was used, and in the previous deoloyment window there are green check marks, so I think one can figure it out when one is interested [21:19:40] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:19:43] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:20:03] MatmaRex: more worried about incurring the wrath of enwiki by breaking their MOS:xxxx links. [21:21:16] i don't fear mariadb or the SREs but the wrath of enwiki copyeditors whose voluminous citations to the manual of style are broken. :) [21:21:22] PROBLEM - Host cr2-magru is DOWN: PING CRITICAL - Packet loss = 100% [21:21:50] RECOVERY - Host cr2-magru is UP: PING OK - Packet loss = 0%, RTA = 115.63 ms [21:22:03] huh [21:22:17] acked the alert. [21:22:22] Here to help [21:22:30] interesting ... here are well [21:22:34] it recovered but what happened?! [21:22:54] very odd yeah [21:23:40] nothing really in the log suggesting an issue [21:23:51] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:24:17] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:24:32] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:25:38] not near a laptop sadly [21:25:52] topranks: cr1-magru looks fine? [21:26:08] denisse: we got a recovery here but not on victorops [21:26:28] librenms "recent events" says `2024-09-09 21:20:19 Device status changed to Down from snmp check.` [21:26:46] and has no subsequent recovery event [21:26:58] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:27:06] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:29:05] (03CR) 10Dzahn: [C:03+1] "ah, yes, of course. I just saw that +1 there, we will be patient :)" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:29:20] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:29:24] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:31:12] cr1-magru's et-0/0/0 toward cr2-magru certainly looks fine? e.g., traffic flowing between them as if cr2 is in fact alive [21:31:37] it's 100% alive [21:31:42] and 100% had no outage [21:31:56] bgp sessions are stable for weeks on it etc [21:32:07] so it's a question of why the alert fired [21:32:40] ah, and indeed now `2024-09-09 21:25:13 Device status changed to Up from snmp check.` [21:34:57] in any case, thank you very much, topranks [21:35:01] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:35:02] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:35:05] topranks: ok that's good at least :] [21:35:12] indeed! [21:35:18] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:35:20] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:35:35] I can file a task later so that we can look into it [21:35:45] especially given cr* and paging alert [21:36:04] swfrench-wmf: of course feel free to file one since you are here [21:36:31] sukhe: yes, can do [21:36:57] the other ques is [21:37:06] why hasn't victorops recovered [21:37:09] host clearly is up [21:37:12] (03PS1) 10Bartosz Dziewoński: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 [21:37:13] (03PS1) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 [21:37:37] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:37:54] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:40:19] (03CR) 10C. Scott Ananian: "Ran out of time in the backport window. Will try again tomorrow(ish)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [21:42:07] I'm gonna log off, definitely not seeing any sign of an issue that caused that. So kind of stumped [21:42:20] as for victorops - may be some librenms quirk unsure [21:43:27] (03CR) 10Bking: [C:03+2] flink-app/rdf-streaming-updater: remove rdf-specific changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071706 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [21:43:36] (03PS1) 10Lucas Werkmeister: errorpage: Remove redundant 'unknown' $reqId fallback [puppet] - 10https://gerrit.wikimedia.org/r/1071714 [21:43:36] (03PS1) 10Lucas Werkmeister: errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 [21:44:09] topranks: <3 [21:45:10] (03CR) 10Lucas Werkmeister: "Optional suggestion :) follows up T291192 / Idb96796483 I suppose." [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (owner: 10Lucas Werkmeister) [21:46:48] sukhe: https://phabricator.wikimedia.org/T374401 [21:47:08] feel free to add any additional follow-up thoughts you might have when you get a chance [21:50:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2338.codfw.wmnet, parse2009.codfw.wmnet, mw2370.codfw.wmnet, mw23 [21:50:29] .wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2394.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2031.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikik [21:50:29] er2030.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [21:50:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, wikikube-worker2079.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codf [21:50:37] wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2394.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse200 [21:50:37] wmnet, kubernetes2050.codfw.wmnet, mw2427.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [21:51:27] PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:31] hey I just wanted to check what happened with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070354 during the deploy window - it says its +2ed but didn't merge? [21:53:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:53:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:56:57] RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:59:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:04:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:04:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004 [22:04:29] mnet, mw2427.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2055.codfw.wmnet, parse2013.codfw.wmnet, wikikube-worker2062.codfw.wmnet, mw2449.codfw.wmnet, mw2397.codfw.wmnet, mw2314.codfw.wmnet, mw2440.codfw.wmnet, wikikube-worker2098.codfw.wmnet, wikikube-worker2105.codfw.wmnet, mw2399.co [22:04:29] t, kubernetes2036.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2013.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2051.codfw.wmnet, wikikube-worker2106.codfw.wmnet, mw2336.codfw https://wikitech.wikimedia.org/wiki/PyBal [22:04:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2079.codfw.wmnet, parse2017.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2017.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wi [22:04:37] orker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker209 [22:04:37] wmnet, mw2302.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2052.codfw.wmnet, mw2394.codfw.wmnet, mw2314.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal [22:05:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:05:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:07:45] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 118 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:12:43] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 48 probes of 706 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:18:00] (03CR) 10Stoyofuku-wmf: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [22:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:43:32] thcipriani: would it be alright if I deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070354 right now outside of a deploy window? [22:44:02] Reason being it was supposed to be deployed during UTC late window but there was an issue with gerrit infra that prevented it from being merged in time and then the window closed [22:44:44] We could also wait until tomorrow, but I'm a little concerned with it being in this weird `Ready to submit` state [22:44:45] toyofuku: yep, fine by me, as long as you're around for a bit [22:44:56] I'm here for the next hour and 15ish [22:44:59] Thank you! [22:45:36] Alright gang, we're gonna do an out of band deploy [22:45:57] * thcipriani holds on to butts [22:45:58] here! [22:47:03] (03CR) 10TrainBranchBot: "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [22:47:10] looking good so far [22:47:26] that got gate and submit unstuck, phew [22:47:50] (03Merged) 10jenkins-bot: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [22:48:02] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1070354|Enable appearance menu for all logged in users on all projects (T371020)]] [22:48:06] T371020: Roll out appearance menu and font size change to sister projects - https://phabricator.wikimedia.org/T371020 [22:50:01] !log toyofuku@deploy1003 toyofuku, jdlrobson: Backport for [[gerrit:1070354|Enable appearance menu for all logged in users on all projects (T371020)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:50:17] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:50:24] Jdlrobson: lmk when you're done testing! [22:54:36] toyofuku: yep on it [22:54:45] ty ty [22:55:49] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:55:59] toyofuku: lgtm! please sync [22:56:03] !log toyofuku@deploy1003 toyofuku, jdlrobson: Continuing with sync [23:00:35] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:00:43] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070354|Enable appearance menu for all logged in users on all projects (T371020)]] (duration: 12m 40s) [23:00:46] T371020: Roll out appearance menu and font size change to sister projects - https://phabricator.wikimedia.org/T371020 [23:01:01] All done! [23:01:08] Thank you so much everyone [23:03:49] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:03:59] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:04:15] Thank you! [23:09:01] PROBLEM - Hadoop NodeManager on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:09:15] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:09:17] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:10:07] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:11:01] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:11:17] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:11:59] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:12:01] RECOVERY - Hadoop NodeManager on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:35] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:31:15] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:31:17] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:34:07] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071722 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071722 (owner: 10TrainBranchBot) [23:39:03] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:50:35] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:56:17] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process