[00:10:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:12:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:17:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:18:40] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:04] (03CR) 10Tim Starling: [C: 03+1] Explicitly disable all local imagescaling on k8s (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [00:31:14] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989180 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989180 (owner: 10TrainBranchBot) [00:48:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:50:51] (03PS1) 10Stoyofuku-wmf: [WIP] Full screen for index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989262 [00:53:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:00:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989180 (owner: 10TrainBranchBot) [01:07:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:12:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:01:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:22:49] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989264 (https://phabricator.wikimedia.org/T352583) [02:30:17] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:50] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:34:06] PROBLEM - PHP opcache health on mw2282 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:38:41] (03CR) 10Gergő Tisza: [C: 03+1] beta: Enable conditional defaults for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987964 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [03:50:19] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [04:23:19] (03PS1) 10Tim Starling: Disable SameSite legacy cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989265 (https://phabricator.wikimedia.org/T344791) [04:41:50] 10SRE, 10WMF-General-or-Unknown: "Our servers are currently under maintenance" page shown on HTTP 429 - https://phabricator.wikimedia.org/T354718 (10Tgr) [04:46:56] RECOVERY - PHP opcache health on mw2444 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:57:46] PROBLEM - cassandra-a CQL 10.192.16.82:9042 on restbase2013 is CRITICAL: connect to address 10.192.16.82 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:57:48] PROBLEM - cassandra-a SSL 10.192.16.82:7000 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [06:41:40] (03PS1) 10Marostegui: es20[35-40]: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/989295 (https://phabricator.wikimedia.org/T354674) [06:43:05] (03CR) 10Marostegui: [C: 03+2] es20[35-40]: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/989295 (https://phabricator.wikimedia.org/T354674) (owner: 10Marostegui) [06:48:28] (03PS1) 10Marostegui: es20[35-40]: Insetup [puppet] - 10https://gerrit.wikimedia.org/r/989382 (https://phabricator.wikimedia.org/T354674) [06:49:43] (03PS2) 10Marostegui: es20[35-40]: Insetup [puppet] - 10https://gerrit.wikimedia.org/r/989382 (https://phabricator.wikimedia.org/T354674) [06:50:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2143.codfw.wmnet with reason: host reimage [06:53:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2143.codfw.wmnet with reason: host reimage [06:54:02] (03CR) 10Marostegui: [C: 03+2] es20[35-40]: Insetup [puppet] - 10https://gerrit.wikimedia.org/r/989382 (https://phabricator.wikimedia.org/T354674) (owner: 10Marostegui) [06:57:12] (03PS1) 10Marostegui: installserver: Add db2[096-120] to partman [puppet] - 10https://gerrit.wikimedia.org/r/989383 (https://phabricator.wikimedia.org/T354210) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T0700) [07:00:54] (03CR) 10Marostegui: [C: 03+2] installserver: Add db2[096-120] to partman [puppet] - 10https://gerrit.wikimedia.org/r/989383 (https://phabricator.wikimedia.org/T354210) (owner: 10Marostegui) [07:05:07] (03CR) 10Kosta Harlan: "What is the timeline for deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) (owner: 10Kamila Součková) [07:06:43] 10SRE, 10ops-codfw, 10DBA: db2143 not rebooting - https://phabricator.wikimedia.org/T354593 (10Marostegui) 05Open→03Resolved Thank you for the fast response! The host is back and the reimage went fine. [07:09:12] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2143.codfw.wmnet with OS bookworm [07:23:33] PROBLEM - PHP opcache health on mw2281 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:50:19] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:51:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989086 (https://phabricator.wikimedia.org/T296533) (owner: 10Muehlenhoff) [07:53:11] !log installing openjdk-8 security updates [07:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:22] (03PS2) 10KartikMistry: testwiki: Enable Section translation on WPs with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988984 (https://phabricator.wikimedia.org/T351882) [07:56:23] Amir1: https://deploy-commands.toolforge.org/ seems down? [07:58:43] RECOVERY - PHP opcache health on mw2278 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:59:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/989212 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T0800) [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:54] Here I'm; Will do self deploy.. [08:01:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988984 (https://phabricator.wikimedia.org/T351882) (owner: 10KartikMistry) [08:01:50] (03Merged) 10jenkins-bot: testwiki: Enable Section translation on WPs with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988984 (https://phabricator.wikimedia.org/T351882) (owner: 10KartikMistry) [08:03:06] !log kartik@deploy2002 Started scap: Backport for [[gerrit:988984|testwiki: Enable Section translation on WPs with Content Translation available as default (T351882)]] [08:03:10] T351882: Enable Section translation on Wikipedias with Content Translation available as default - https://phabricator.wikimedia.org/T351882 [08:03:37] (03PS1) 10Peter Fischer: enable page_rerender for 4th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989442 (https://phabricator.wikimedia.org/T351503) [08:04:51] !log kartik@deploy2002 kartik: Backport for [[gerrit:988984|testwiki: Enable Section translation on WPs with Content Translation available as default (T351882)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:06:11] !log kartik@deploy2002 kartik: Continuing with sync [08:06:41] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:16] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:988984|testwiki: Enable Section translation on WPs with Content Translation available as default (T351882)]] (duration: 09m 10s) [08:12:20] T351882: Enable Section translation on Wikipedias with Content Translation available as default - https://phabricator.wikimedia.org/T351882 [08:12:40] Done with my config deployment.. [08:12:52] (03PS1) 10Peter Fischer: Search update pipeline: 4th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/989443 (https://phabricator.wikimedia.org/T351503) [08:15:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:22:39] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:24:07] PROBLEM - Check systemd state on ncredir2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_lvs_realserver_mss.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:20] (03CR) 10Ayounsi: [C: 03+2] Remove mentions of ganeti-test1001/2 and 2004 [puppet] - 10https://gerrit.wikimedia.org/r/989212 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [08:25:31] (03CR) 10DCausse: [C: 03+1] enable page_rerender for 4th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989442 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:25:39] RECOVERY - Check systemd state on ncredir2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:32] (03CR) 10Muehlenhoff: [C: 03+2] rsync::quickdatacopy: Remove use_generic_firewall and auto_ferm_ipv6 flags [puppet] - 10https://gerrit.wikimedia.org/r/989103 (owner: 10Muehlenhoff) [08:28:53] (03CR) 10Hashar: contint: use php7.4 on bullseye just like on buster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [08:29:09] jouncebot: nowandnext [08:29:09] For the next 0 hour(s) and 30 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T0800) [08:29:09] In 2 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1100) [08:30:54] going to deploy pfischer's config patch [08:34:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989442 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:35:28] (03Merged) 10jenkins-bot: enable page_rerender for 4th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989442 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:35:50] !log dcausse@deploy2002 Started scap: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]] [08:35:54] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:37:37] !log dcausse@deploy2002 pfischer and dcausse: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:49] (03PS1) 10Muehlenhoff: rsync: Remove support for auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/989444 [08:40:19] !log dcausse@deploy2002 pfischer and dcausse: Continuing with sync [08:40:31] (03Abandoned) 10Muehlenhoff: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [08:41:33] !log installing Exim security updates [08:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:53] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [08:42:03] (03Abandoned) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [08:42:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [08:46:37] RECOVERY - PHP opcache health on mw2279 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:47:11] saw failed to sync mw1349.eqiad.wmnet port 22: Connection timed out but seems expected since Alex depooled it [08:47:41] !log dcausse@deploy2002 Finished scap: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]] (duration: 11m 50s) [08:47:45] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:49:08] pfischer: deploy should be done [08:52:24] (03CR) 10Hashar: gerrit: make LDAP groups visible to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [08:56:41] (03CR) 10Muehlenhoff: [C: 03+1] gerrit: make LDAP groups visible to users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [08:57:40] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 13150 [08:59:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13150 [09:00:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15133 [09:01:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15133 [09:04:05] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:57] (03PS1) 10Slyngshede: C:idm::deployment Display UID from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/989446 (https://phabricator.wikimedia.org/T338825) [09:07:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/989446 (https://phabricator.wikimedia.org/T338825) (owner: 10Slyngshede) [09:17:25] (03CR) 10Muehlenhoff: [C: 03+2] nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [09:21:20] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment Display UID from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/989446 (https://phabricator.wikimedia.org/T338825) (owner: 10Slyngshede) [09:23:35] 10SRE, 10Infrastructure-Foundations: Further enhancements for nftables support in profile::firewall - https://phabricator.wikimedia.org/T348498 (10MoritzMuehlenhoff) [09:24:08] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Processing of config file includes broken in Buster / nftables 0.9.0 - https://phabricator.wikimedia.org/T354279 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've updated Puppet to pull nftables from backports when using nftables... [09:25:36] (03CR) 10Muehlenhoff: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [09:26:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) So I'm relatively new at the organisation - I don't really have enough of an overview to be able to say in a fine-grained way what access I will n... [09:27:03] (03CR) 10DCausse: [C: 03+1] Search update pipeline: 4th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/989443 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [09:38:26] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1378.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [09:38:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1378.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [09:38:48] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [09:38:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [09:42:54] 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) [09:43:04] 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) @Papaul could you take care of it ? [09:53:07] !log hashar@deploy2002 Started deploy [integration/docroot@355ddbb]: Dummy deploy to test git safe.directory # T335354 [09:53:11] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testreduce1001.eqiad.wmnet [09:53:13] !log hashar@deploy2002 Finished deploy [integration/docroot@355ddbb]: Dummy deploy to test git safe.directory # T335354 (duration: 00m 06s) [09:55:11] !log installing git security updates on deployment hosts [09:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] !log hashar@deploy2002 Started deploy [integration/docroot@355ddbb]: (no justification provided) [09:55:56] !log hashar@deploy2002 Finished deploy [integration/docroot@355ddbb]: (no justification provided) (duration: 00m 04s) [09:57:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:59:16] (03PS6) 10Clément Goubert: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [09:59:24] (03CR) 10Clément Goubert: service.yaml: add iPoid to the service catalogue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [10:00:18] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [10:00:33] (03PS3) 10WMDE-Fisch: [beta] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) [10:00:37] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [10:00:59] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [10:01:15] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [10:02:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testreduce1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:03:00] (03CR) 10WMDE-Fisch: [beta] Allow Cite events for reference previews baseline stats (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [10:03:04] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/989451 [10:03:27] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/989451 [10:04:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testreduce1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:04:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testreduce1001.eqiad.wmnet [10:06:59] (03PS1) 10Muehlenhoff: Remove puppet references to testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/989452 (https://phabricator.wikimedia.org/T345220) [10:07:46] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/989451 (owner: 10Muehlenhoff) [10:14:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1063/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [10:14:41] (03CR) 10JMeybohm: Helm chart for k8s-controller-sidecars (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [10:15:22] (03CR) 10MVernon: [C: 03+1] "This seems plausible to me (though this is not an expert review!)." [puppet] - 10https://gerrit.wikimedia.org/r/984516 (owner: 10Muehlenhoff) [10:15:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppet references to testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/989452 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [10:15:31] (03PS2) 10Muehlenhoff: Remove puppet references to testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/989452 (https://phabricator.wikimedia.org/T345220) [10:16:02] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [10:17:11] (03CR) 10JMeybohm: admin_ng: Install k8s-controller-sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [10:22:06] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989264 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [10:22:38] RECOVERY - PHP opcache health on mw2426 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:23:05] (03CR) 10Jelto: [C: 03+2] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989264 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [10:24:19] (03Merged) 10jenkins-bot: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989264 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [10:26:36] (03PS1) 10Muehlenhoff: Move Puppet 7 config towards the testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/989454 (https://phabricator.wikimedia.org/T345220) [10:27:49] 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) [10:29:47] (03PS1) 10Alexandros Kosiaris: kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) [10:29:58] (03CR) 10Muehlenhoff: [C: 03+2] Move Puppet 7 config towards the testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/989454 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [10:32:34] (03CR) 10Alexandros Kosiaris: "Moritz, I 've shied away from overall blacklisting the module via base. We have identified a batch of hosts that do use it but don't exhib" [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [10:33:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:34:12] (03PS3) 10Btullis: Bring an-master1003 into service as a hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) [10:34:14] (03PS3) 10Btullis: Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) [10:39:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. If we ever see this causing issues outside of the k8s workers, we can also move it to the base::kernel class later." [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [10:39:28] (03CR) 10Hashar: [C: 04-1] contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [10:39:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 26 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:40:29] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9446142, @MoritzMuehlenhoff wrote: > One other option is that the TLS toolchain as used by Orchestrator be not handle... [10:40:45] (03CR) 10Jelto: [C: 03+2] "@Antoine let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [10:42:45] (03CR) 10Hashar: "That can be merged at any time and that will be taken in account by restarting Gerrit which I can do any time (or at worse tomorrow when I" [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [10:46:06] !log installing curl security updates [10:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:11] (03PS1) 10Ayounsi: Add SameSite=strict attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) [10:54:28] (03PS2) 10Ayounsi: Add SameSite=Strict attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) [10:57:34] (03CR) 10Ayounsi: [C: 03+1] Add BGP session between mr1-codfw and lsw1-a2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989224 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1100) [11:01:43] (03PS1) 10Vgutierrez: lvs::realserver::ipip: Report errors on MSS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/989459 (https://phabricator.wikimedia.org/T354721) [11:02:25] (03PS1) 10Klausman: profile::thanos: Add two more latency buckets to recording rule [puppet] - 10https://gerrit.wikimedia.org/r/989458 [11:03:05] !log installing PHP 7.3 security updates [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:57] (03CR) 10Brouberol: [C: 03+1] "PCC output looks good: NOOP for everything except the actual node." [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:08:52] RECOVERY - PHP opcache health on mw2281 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:09:21] (03PS2) 10Alexandros Kosiaris: kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) [11:09:23] (03PS1) 10Alexandros Kosiaris: kmod::blacklist: Allow also rmmoding modules [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) [11:09:50] RECOVERY - PHP opcache health on mw2282 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:10:17] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:12:47] (03CR) 10CI reject: [V: 04-1] kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:13:10] (03PS1) 10Stevemunene: Remove puppet references for druid1004_6 [puppet] - 10https://gerrit.wikimedia.org/r/989461 (https://phabricator.wikimedia.org/T336043) [11:15:16] (03PS1) 10Hashar: stewards: remove umask parameter from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/989463 (https://phabricator.wikimedia.org/T338277) [11:15:36] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989463 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [11:18:04] (03CR) 10Stevemunene: [C: 03+1] Bring an-master1003 into service as a hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:23:44] (03CR) 10Hashar: "The umask was added to git::clone at a time when git did not have support for sharing a checkout between different users (which is `core.s" [puppet] - 10https://gerrit.wikimedia.org/r/989463 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [11:26:22] (03CR) 10Muehlenhoff: kmod::blacklist: Allow also rmmoding modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:26:42] (03CR) 10FNegri: [C: 03+1] mariadb: remove grants and firewall rules for dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989088 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [11:26:47] (03CR) 10FNegri: [C: 03+1] Move dbproxy1018/9 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988681 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [11:27:50] (03CR) 10JMeybohm: mediawiki: Support one-off jobs (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [11:28:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [11:28:36] (03CR) 10JMeybohm: Add helmfile for running MediaWiki one-off jobs. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [11:28:40] (03CR) 10JMeybohm: deployment_server: Add mwscript_k8s (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [11:30:45] (03PS2) 10Alexandros Kosiaris: kmod::blacklist: Allow also rmmoding modules [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) [11:30:47] (03PS3) 10Alexandros Kosiaris: kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) [11:31:32] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring an-master1003 into service as a hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/989213 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:32:06] (03CR) 10JMeybohm: deployment_server: Add mwscript_k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [11:33:59] (03CR) 10Alexandros Kosiaris: kmod::blacklist: Allow also rmmoding modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:35:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:36:08] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:36:23] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:37:11] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:37:41] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:39:10] !log stevemunene@cumin1002 START - Cookbook sre.hosts.decommission for hosts druid1004.eqiad.wmnet [11:41:52] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:41:55] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:41:56] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [11:41:56] (03PS3) 10Alexandros Kosiaris: kmod::blacklist: Allow also rmmoding modules [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) [11:41:58] (03PS4) 10Alexandros Kosiaris: kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) [11:42:19] (03CR) 10Kamila Součková: [C: 03+1] kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:42:25] (03CR) 10Alexandros Kosiaris: kmod::blacklist: Allow also rmmoding modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:42:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:43:10] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [11:43:11] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [11:43:17] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [11:46:07] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:46:10] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:46:23] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [11:46:26] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [11:46:35] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [11:47:04] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [11:50:19] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:51:07] (03CR) 10Kamila Součková: kmod::blacklist: Allow also rmmoding modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [11:51:14] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [11:54:33] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [11:54:58] (03CR) 10Awight: [C: 03+1] "Solid. We might end up reducing the prefix match to "ext.cite." eventually, so we can include additional events, but this is great for no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989204 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [11:56:02] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [11:56:02] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:56:03] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1004.eqiad.wmnet [12:05:48] !log stevemunene@cumin1002 START - Cookbook sre.hosts.decommission for hosts druid1005.eqiad.wmnet [12:10:31] (03PS1) 10Hnowlan: mw-jobrunner: increase replicas for parsoidCachePrewarm [deployment-charts] - 10https://gerrit.wikimedia.org/r/989488 (https://phabricator.wikimedia.org/T349796) [12:11:27] (03PS1) 10Hnowlan: changeprop-jobqueue: migrate parsoidCachePrewarm to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989489 (https://phabricator.wikimedia.org/T349796) [12:12:32] (03PS1) 10Btullis: Allow deep merging of hadoop config overrides [puppet] - 10https://gerrit.wikimedia.org/r/989490 (https://phabricator.wikimedia.org/T332573) [12:18:11] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [12:18:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 27 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/989490 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:20:19] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [12:20:21] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: increase replicas for parsoidCachePrewarm [deployment-charts] - 10https://gerrit.wikimedia.org/r/989488 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:21:02] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: migrate parsoidCachePrewarm to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989489 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:21:28] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [12:21:28] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:21:28] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1005.eqiad.wmnet [12:22:26] !log stevemunene@cumin1002 START - Cookbook sre.hosts.decommission for hosts druid1006.eqiad.wmnet [12:27:35] (03CR) 10Btullis: [V: 03+1 C: 03+2] Allow deep merging of hadoop config overrides [puppet] - 10https://gerrit.wikimedia.org/r/989490 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:34:22] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: increase replicas for parsoidCachePrewarm [deployment-charts] - 10https://gerrit.wikimedia.org/r/989488 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:35:15] (03Merged) 10jenkins-bot: mw-jobrunner: increase replicas for parsoidCachePrewarm [deployment-charts] - 10https://gerrit.wikimedia.org/r/989488 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:35:51] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [12:37:00] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [12:37:07] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:37:34] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [12:37:39] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:37:54] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [12:38:10] (03PS2) 10Stevemunene: Remove puppet references for druid1004_6 [puppet] - 10https://gerrit.wikimedia.org/r/989461 (https://phabricator.wikimedia.org/T336043) [12:38:59] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: druid1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [12:39:00] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:39:00] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts druid1006.eqiad.wmnet [12:41:49] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: 4th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/989443 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [12:42:37] (03Merged) 10jenkins-bot: Search update pipeline: 4th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/989443 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [12:47:27] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [12:48:24] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-workers (exit_code=99) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [12:57:56] (03CR) 10Muehlenhoff: [C: 03+2] swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516 (owner: 10Muehlenhoff) [13:01:02] (03PS1) 10Clément Goubert: docker-report: Exclude more images [puppet] - 10https://gerrit.wikimedia.org/r/989493 [13:05:16] PROBLEM - PHP opcache health on mw2446 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:09:03] (03PS2) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 [13:13:37] (03PS3) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 [13:15:35] !log test prometheus 2.48.1 on prometheus1005 - T354399 [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [13:23:14] (03PS1) 10Andrew Bogott: designate alert: fix datasource [alerts] - 10https://gerrit.wikimedia.org/r/989494 (https://phabricator.wikimedia.org/T354365) [13:26:27] (03CR) 10Andrew Bogott: [C: 03+2] designate alert: fix datasource [alerts] - 10https://gerrit.wikimedia.org/r/989494 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [13:28:43] (03PS2) 10Klausman: profile::thanos: Add two more latency buckets to recording rule [puppet] - 10https://gerrit.wikimedia.org/r/989458 [13:29:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff) [13:30:34] (03PS3) 10Klausman: profile::thanos: Remove latency histo bucket filter [puppet] - 10https://gerrit.wikimedia.org/r/989458 [13:31:33] (03PS4) 10Klausman: profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 [13:34:00] (03PS1) 10Majavah: P:toolforge::prometheus: add pint alert linter [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) [13:34:11] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1066/co" [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) (owner: 10Majavah) [13:36:09] (03CR) 10David Caro: [C: 03+1] "LGTM \o/" [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) (owner: 10Majavah) [13:37:10] PROBLEM - Disk space on vrts1002 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [13:38:35] (03PS1) 10Andrew Bogott: novafullstack: change datasource to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/989498 (https://phabricator.wikimedia.org/T351698) [13:41:38] (03PS2) 10Majavah: P:toolforge::prometheus: add pint alert linter [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) [13:42:09] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate parsoidCachePrewarm to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989489 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:42:34] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1067/co" [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) (owner: 10Majavah) [13:43:01] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate parsoidCachePrewarm to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/989489 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:45:28] PROBLEM - Disk space on vrts1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1001&var-datasource=eqiad+prometheus/ops [13:48:35] (03PS1) 10JMeybohm: prometheus::k8s: Fix labeldrop actions [puppet] - 10https://gerrit.wikimedia.org/r/989500 (https://phabricator.wikimedia.org/T354604) [13:49:04] (03PS2) 10JMeybohm: prometheus::k8s: Fix labeldrop actions [puppet] - 10https://gerrit.wikimedia.org/r/989500 (https://phabricator.wikimedia.org/T354604) [13:49:10] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/977223 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [13:49:25] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge::prometheus: add pint alert linter [puppet] - 10https://gerrit.wikimedia.org/r/989496 (https://phabricator.wikimedia.org/T354760) (owner: 10Majavah) [13:51:59] (03CR) 10Ayounsi: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [13:53:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1068/co" [puppet] - 10https://gerrit.wikimedia.org/r/989500 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [13:53:42] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Fix labeldrop actions [puppet] - 10https://gerrit.wikimedia.org/r/989500 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [13:54:15] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:54:39] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:55:30] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus::k8s: Fix labeldrop actions [puppet] - 10https://gerrit.wikimedia.org/r/989500 (https://phabricator.wikimedia.org/T354604) (owner: 10JMeybohm) [13:55:36] (03PS1) 10Andrew Bogott: designate: fix alert wording [alerts] - 10https://gerrit.wikimedia.org/r/989501 (https://phabricator.wikimedia.org/T354365) [13:56:04] (03CR) 10Volans: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977223 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [13:56:26] (03CR) 10Andrew Bogott: [C: 03+2] novafullstack: change datasource to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/989498 (https://phabricator.wikimedia.org/T351698) (owner: 10Andrew Bogott) [13:56:35] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:56:59] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:57:44] RECOVERY - Disk space on vrts1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [13:58:17] (03Merged) 10jenkins-bot: novafullstack: change datasource to 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/989498 (https://phabricator.wikimedia.org/T351698) (owner: 10Andrew Bogott) [13:58:20] (03CR) 10Andrew Bogott: [C: 03+2] designate: fix alert wording [alerts] - 10https://gerrit.wikimedia.org/r/989501 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [13:58:41] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T354765 (10phaultfinder) [13:58:53] (03CR) 10Cathal Mooney: [C: 03+2] Add BGP session between mr1-codfw and lsw1-a2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989224 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [13:59:08] (03CR) 10Hnowlan: [C: 03+1] docker-report: Exclude more images [puppet] - 10https://gerrit.wikimedia.org/r/989493 (owner: 10Clément Goubert) [13:59:34] (03Merged) 10jenkins-bot: designate: fix alert wording [alerts] - 10https://gerrit.wikimedia.org/r/989501 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [14:00:00] (03Merged) 10jenkins-bot: Add BGP session between mr1-codfw and lsw1-a2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989224 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:00:46] yup, nothing to do it seems [14:03:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] phabricator: avoid duplicate list of servers in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [14:03:32] !log Switching operations-puppet-tests-buster-docker Jenkins job from tox v3 to tox v4 | T345152 [14:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:35] T345152: [ci,operations-puppet] upgrade to tox 4 in order to detect changed requirement files - https://phabricator.wikimedia.org/T345152 [14:04:02] !log installing openblas bugfix updates [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:48] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:06:00] RECOVERY - Disk space on vrts1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1001&var-datasource=eqiad+prometheus/ops [14:08:18] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [14:12:09] 10SRE, 10Infrastructure-Foundations, 10Traffic: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10JameelKaisar) 05Open→03Resolved [14:12:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:13:32] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:14:08] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989503 [14:15:12] 10SRE-swift-storage: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10ayounsi) [14:15:28] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989503 [14:15:50] 10SRE-swift-storage: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10ayounsi) [14:16:34] 10SRE, 10Infrastructure-Foundations: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317 (10JameelKaisar) 05Open→03Resolved [14:16:41] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) [14:18:37] 10SRE, 10Traffic: "Our servers are currently under maintenance" page shown on HTTP 429 - https://phabricator.wikimedia.org/T354718 (10MatthewVernon) [14:19:04] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:19:52] (03PS3) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989503 (https://phabricator.wikimedia.org/T350502) [14:20:02] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989503 (https://phabricator.wikimedia.org/T350502) (owner: 10Kosta Harlan) [14:20:43] 10SRE, 10Traffic: "Our servers are currently under maintenance" page shown on HTTP 429 - https://phabricator.wikimedia.org/T354718 (10Vgutierrez) p:05Triage→03Medium [14:20:53] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989503 (https://phabricator.wikimedia.org/T350502) (owner: 10Kosta Harlan) [14:21:36] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:21:40] !log installing lapack bugfix updates [14:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:46] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) @ayounsi will do [14:22:05] (03CR) 10Clément Goubert: [C: 03+2] docker-report: Exclude more images [puppet] - 10https://gerrit.wikimedia.org/r/989493 (owner: 10Clément Goubert) [14:22:09] (03PS4) 10Kamila Součková: kmod::blacklist: Allow also rmmoding modules [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [14:22:10] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:22:16] (03CR) 10Dzahn: [C: 03+2] "ack, the class sets umask 002 as default when shared is true" [puppet] - 10https://gerrit.wikimedia.org/r/989463 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:23:03] 10SRE-swift-storage: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10MatthewVernon) Hi! I can certainly create you another swift account. Naming things is hard, but are you //sure// you want netbox-next rather than, say, netbox-dev? To me, netbox-next sounds like an account y... [14:24:57] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [14:25:00] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [14:25:46] (03Abandoned) 10Dzahn: puppet: add quota module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [14:26:00] (03CR) 10Kamila Součková: [C: 03+2] kmod::blacklist: Allow also rmmoding modules [puppet] - 10https://gerrit.wikimedia.org/r/989460 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [14:26:16] 10SRE-swift-storage: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10ayounsi) Indeed naming is hard. netbox-next runs on netbox-dev2002.codfw.wmnet. Which was initially made to test future upgrades (kind of what I'm doing here) but ended up being a "test all" server. That said... [14:26:26] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:33] (03CR) 10Kamila Součková: [C: 03+2] kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [14:26:46] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1378.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [14:26:47] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:26:56] (03PS5) 10Kamila Součková: kubernetes::node: Blacklist wdat_wdt kernel module [puppet] - 10https://gerrit.wikimedia.org/r/989455 (https://phabricator.wikimedia.org/T354413) (owner: 10Alexandros Kosiaris) [14:27:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1378.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [14:27:01] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:27:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: lvs::balancer [14:30:29] (03PS1) 10Muehlenhoff: Switch LVS to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989528 (https://phabricator.wikimedia.org/T349619) [14:32:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch LVS to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989528 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:32:49] kamila_: I'll puppet-merge your kmod patch along [14:33:01] sure, thanks [14:33:11] there's a second one incoming too, I was just waiting for CI [14:34:42] (03CR) 10Ssingh: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/989528 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:35:41] (03CR) 10Hnowlan: [C: 03+1] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [14:35:43] (03CR) 10Kamila Součková: [C: 03+1] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [14:36:24] (03PS1) 10MVernon: hiera: add new netboxdev:attachments user [puppet] - 10https://gerrit.wikimedia.org/r/989529 (https://phabricator.wikimedia.org/T354766) [14:38:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1380.eqiad.wmnet with OS bullseye [14:39:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:13] (03PS1) 10MVernon: hiera: add fake swift passwords for netbox_dev user [labs/private] - 10https://gerrit.wikimedia.org/r/989531 (https://phabricator.wikimedia.org/T354766) [14:39:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [14:39:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1349.eqiad.wmnet with reason: Trying to reproduce wdat_wdt watchdog problem [14:43:32] (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/989248 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [14:44:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: lvs::balancer [14:44:47] (03PS5) 10Dzahn: phabricator: avoid duplicate lists of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [14:45:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:49:05] (03CR) 10Majavah: [C: 03+2] Move dbproxy1018/9 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988681 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:51:48] (03PS5) 10Dzahn: phabricator: avoid duplicate list of servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [14:51:52] 10SRE-swift-storage, 10Patch-For-Review: Create swift account for netbox-next - https://phabricator.wikimedia.org/T354766 (10MatthewVernon) >>! In T354766#9450359, @ayounsi wrote: > Usage will indeed be light, most likely a few cats pictures. I've put in CRs to create the account; once they have +1 I'll also... [14:52:52] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [14:52:54] !log taavi@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy[1018-1019].eqiad.wmnet [14:52:57] (03CR) 10Dzahn: phabricator: avoid duplicate lists of servers in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [14:54:29] !log adding vlans to ssw1-a8-codfw to trunk to lvs2014 T352758 [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:36] T352758: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 [14:55:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: logging::opensearch::collector [14:57:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [14:59:12] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1500) [15:00:16] (03CR) 10Dzahn: "@Jelto well, full disclosure is I had also kept it that way because then it would match more "releases_server" and "releases_servers_failo" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:00:35] PROBLEM - Check systemd state on ml-staging2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:44] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [15:00:51] (03PS1) 10Muehlenhoff: Switch logging::opensearch::collector to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989533 (https://phabricator.wikimedia.org/T349619) [15:01:35] RECOVERY - Check systemd state on ml-staging2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:52] !log disable puppet and stop pybal on lvs2014: T352758 [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:58] T352758: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 [15:01:58] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [15:03:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1071/console" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:03:29] 10SRE, 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10Jhancock.wm) [15:03:42] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti-test2004 - https://phabricator.wikimedia.org/T354681 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:03:59] 10SRE, 10Traffic, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) fabfur opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/7 add first code draft to manage eventual kafka errors [15:03:59] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy[1018-1019].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [15:04:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on lvs2014.codfw.wmnet with reason: T352758 [15:04:15] (03PS6) 10Dzahn: phabricator: avoid duplicate lists of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) [15:04:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs2014.codfw.wmnet with reason: T352758 [15:04:34] (03CR) 10Dzahn: "PS6 was only changes to the commit message, because I saw you were already compiling :) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:05:07] (03PS1) 10Ssingh: depool codfw: do not merge! emergency depool patch [dns] - 10https://gerrit.wikimedia.org/r/989534 (https://phabricator.wikimedia.org/T352758) [15:06:03] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/989240/1072/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:06:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:32] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy[1018-1019].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [15:06:32] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:33] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy[1018-1019].eqiad.wmnet [15:08:12] (03PS1) 10Dzahn: switch phabricator server to codfw [dns] - 10https://gerrit.wikimedia.org/r/989535 [15:08:14] (KubernetesCalicoDown) firing: (3) ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:08:14] (CalicoKubeControllersDown) firing: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:09:18] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks for fixing the naming :)" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:11:38] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:11:50] (03CR) 10Muehlenhoff: [C: 03+2] Switch logging::opensearch::collector to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989533 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:12:57] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-master[1003-1004].eqiad.wmnet with reason: Bringing new nameservers into service [15:13:05] !log klausman@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-staging2001.codfw.wmnet [15:13:14] (KubernetesCalicoDown) resolved: (3) ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:13:14] (CalicoKubeControllersDown) resolved: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:13:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-master[1003-1004].eqiad.wmnet with reason: Bringing new nameservers into service [15:13:36] 10sre-alert-triage, 10SRE Observability (FY2023/2024-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255 (10lmata) [15:14:27] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [15:16:17] (03PS1) 10Dzahn: mariadb: add mysql grants for phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/989536 [15:17:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1380.eqiad.wmnet with OS bullseye [15:19:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1381.eqiad.wmnet with OS bullseye [15:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: logging::opensearch::collector [15:20:35] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1382.eqiad.wmnet with OS bullseye [15:21:07] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1383.eqiad.wmnet with OS bullseye [15:21:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:49] ^ this is expected [15:21:52] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2013.codfw.wmnet with reason: Decommissioning — T352469 [15:21:55] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [15:21:58] thanks sukhe [15:22:06] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2013.codfw.wmnet with reason: Decommissioning — T352469 [15:22:59] (03PS3) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [15:23:12] (03CR) 10Eevans: [C: 03+2] restbase: configure new hosts for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/989248 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [15:23:22] (03PS4) 10Effie Mouzeli: modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [15:24:25] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [15:24:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: logging::opensearch::data [15:25:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:26:08] (03PS1) 10Dzahn: phabricator: use same db server regardless of DC of phab server [puppet] - 10https://gerrit.wikimedia.org/r/989537 [15:26:39] (03PS4) 10Btullis: Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) [15:27:29] (03PS1) 10Muehlenhoff: Switch logging::opensearch::data to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989538 (https://phabricator.wikimedia.org/T349619) [15:27:33] (03CR) 10Dzahn: "at this point it's for discussion, no merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [15:29:13] (03PS2) 10Vgutierrez: lvs::realserver::ipip: Report errors on MSS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/989459 (https://phabricator.wikimedia.org/T354721) [15:29:24] (03CR) 10Dzahn: [C: 04-1] "planned for Jan 20th" [dns] - 10https://gerrit.wikimedia.org/r/989535 (owner: 10Dzahn) [15:29:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch logging::opensearch::data to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/989538 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:30:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:31:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:34:33] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [15:35:09] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [15:35:33] (03PS1) 10Dzahn: phabricator: avoid duplicate lists of servers in migration class [puppet] - 10https://gerrit.wikimedia.org/r/989540 (https://phabricator.wikimedia.org/T354221) [15:35:59] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [15:36:08] (03PS3) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [15:37:13] (03PS2) 10Majavah: mariadb: remove grants and firewall rules for dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989088 (https://phabricator.wikimedia.org/T346947) [15:37:15] (03PS1) 10Majavah: site: remove dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989541 (https://phabricator.wikimedia.org/T346947) [15:37:17] (03PS1) 10Majavah: P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 [15:37:31] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [15:37:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [15:38:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall looks good, there's a couple things I'd improve before going live" [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [15:38:56] (03CR) 10Majavah: [C: 03+2] site: remove dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989541 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [15:39:17] (03CR) 10CI reject: [V: 04-1] P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 (owner: 10Majavah) [15:40:08] (03CR) 10David Caro: [C: 03+1] "LGTM, you can cherry-pick it in the tools puppetmaster if you want before merging" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [15:40:10] (03PS2) 10Majavah: P:wmcs::db: querysampler: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/989542 [15:40:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [15:40:49] (03PS13) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [15:40:51] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:41:09] (03CR) 10Majavah: [C: 03+2] mariadb: remove grants and firewall rules for dbproxy1018/9 [puppet] - 10https://gerrit.wikimedia.org/r/989088 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [15:41:11] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:41:19] (03PS3) 10Clément Goubert: prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) [15:41:30] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1379.eqiad.wmnet with OS bullseye [15:41:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: logging::opensearch::data [15:41:51] (03PS4) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [15:42:09] (03PS4) 10Clément Goubert: prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) [15:42:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: avoid duplicate lists of servers in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:43:05] PROBLEM - Check size of conntrack table on mw1382 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.224: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:44:18] ^ expected I guess as per SAL [15:44:28] just checked and it's mw1382 [15:44:42] reimage is 1379 but seems close enough to be likely? [15:44:54] mutante: this one I meant https://sal.toolforge.org/log/Js8F9IwBxE1_1c7shRgP [15:45:12] sukhe: I see, but it ended with FAIL.. hmmm [15:45:18] guess that's why [15:45:33] kamila_: you aware that reimage failed on that one? [15:46:19] PROBLEM - Check size of conntrack table on mw1382 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.224: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:46:23] PROBLEM - Check systemd state on mw1382 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.224. Check system logs on 10.64.48.224 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:23] I checked it's pooled = inactive [15:46:25] * kamila_ looking [15:46:34] ack, thanks [15:46:37] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1382 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.224. Check system logs on 10.64.48.224 https://wikitech.wikimedia.org/wiki/NTP [15:47:15] RECOVERY - Check size of conntrack table on mw1382 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:47:27] RECOVERY - Check systemd state on mw1382 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:54] (03CR) 10Majavah: [C: 03+1] cloud-vps puppet encapi: use project_id instead of project_name for keystone [puppet] - 10https://gerrit.wikimedia.org/r/988051 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [15:48:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:48:38] maybe the failure was to set the downtime because otherwise I would expect those to be in downtime while reimage runs [15:48:41] I think it should be okay (it says failed but it's fine) [15:48:52] yes, it was just the downtime, I believe the rest went okay [15:48:55] like the cookbook would normally do that [15:48:58] yes [15:49:00] ack, thanks kamila_ [15:49:04] the hosts are in a funky state [15:49:10] but I'm trying to get them out of the funky state now :D [15:49:36] alright:) [15:49:42] (03CR) 10Giuseppe Lavagetto: "I would suggest to remove the mwscript.enabled condition from monitoring containers, and instead rely on values.monitoring.enabled set to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [15:50:13] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:50:16] not sure why the ones that finished reimaging are alerting though, I'll keep an eye out [15:51:52] (03PS1) 10Brouberol: spark-history: add spark logging/env configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/989545 [15:52:21] kamila_: once it was a bug that sometimes setting the downtimes failed (https://phabricator.wikimedia.org/T239897) could be that or similar [15:52:25] PROBLEM - Host mw1382 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:53:26] mutante: yeah, might be something like that, but I'm really not sure why [15:53:49] aaaaah [15:54:03] RECOVERY - Host mw1382 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:54:11] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:54:22] kamila_: maybe take a look if it has the "spicerack.remote.RemoteExecutionError: Cumin execution failed" in the cookbook output [15:55:04] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) Summary of the discussion on the linked CR: - LLDP based logic runs the risk o... [15:55:07] yes, it does [15:55:10] (03CR) 10Clément Goubert: k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:55:11] but I don't know why it's failing [15:55:29] (03PS5) 10Effie Mouzeli: modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [15:56:39] (03CR) 10CI reject: [V: 04-1] modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [15:57:27] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1382 is OK: OK: synced at Wed 2024-01-10 15:57:25 UTC. https://wikitech.wikimedia.org/wiki/NTP [15:57:32] kamila_: you can check more verbose cookbook logs under /var/log/spicerack on the cumin host [15:57:36] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [15:57:38] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10JMeybohm) [15:57:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1382.eqiad.wmnet with OS bullseye [15:58:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:59:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1381.eqiad.wmnet with OS bullseye [16:00:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1383.eqiad.wmnet with OS bullseye [16:00:18] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10akosiaris) >>! In T352893#9450792, @Clement_Goubert wrote: > Summary of the discussion on the l... [16:01:37] claime: thanks, good to know [16:01:54] I'm not sure it's worth it for these hosts though, given that they were in a weird state already [16:02:13] I'll look into it (and or point vo.lans at it) if it keeps happening for hosts that aren't known weird [16:02:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [16:02:43] (they did actually reimage fine, just the downtime didn't work for some reason) [16:06:55] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) For reference, here's a screenshot of more kafka metrics around enabling compaction: {F41... [16:07:20] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on phab2002 -> aphlict1002 -> phab1004. thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/989240 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [16:08:30] (03PS2) 10Brouberol: spark-history: add spark logging/env configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/989545 [16:09:25] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450792, @Clement_Goubert wrote: > I am left wondering if the fear of L... [16:10:57] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/989549 (https://phabricator.wikimedia.org/T349619) [16:13:29] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [16:14:14] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/989549 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:14:49] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/989545 (owner: 10Brouberol) [16:15:26] (03PS3) 10Brouberol: spark-history: add spark logging/env configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/989545 (https://phabricator.wikimedia.org/T354777) [16:16:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10Papaul) @cmooney link moved to ssw1-a8 [16:18:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [16:18:25] (03CR) 10Brouberol: [C: 03+2] spark-history: add spark logging/env configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/989545 (https://phabricator.wikimedia.org/T354777) (owner: 10Brouberol) [16:19:00] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10VRiley-WMF) These have been removed and scripts ran to decomm them. [16:19:12] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:16] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10VRiley-WMF) 05Open→03Resolved [16:20:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [16:20:55] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T353913 (10VRiley-WMF) Rebalanced power. Completing ticket. [16:21:07] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T353913 (10VRiley-WMF) 05Open→03Resolved [16:22:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1379.eqiad.wmnet with OS bullseye [16:22:58] (03PS2) 10Dzahn: phabricator: avoid duplicate lists of servers in migration class [puppet] - 10https://gerrit.wikimedia.org/r/989540 (https://phabricator.wikimedia.org/T354221) [16:25:29] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw[1379-1383].eqiad.wmnet with reason: testing reboot [16:25:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw[1379-1383].eqiad.wmnet with reason: testing reboot [16:26:25] (03PS1) 10Ladsgroup: mariadb: Change nagios user identfication to be unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/989556 [16:28:47] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Volans) I might be missing context, but why we can't get that info from netbox? Extracting it d... [16:29:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:39] (03PS3) 10Dzahn: phabricator: avoid duplicate lists of servers in migration class [puppet] - 10https://gerrit.wikimedia.org/r/989540 (https://phabricator.wikimedia.org/T354221) [16:29:44] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10VRiley-WMF) a:03VRiley-WMF [16:29:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [16:30:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [16:31:03] (03PS4) 10Dzahn: phabricator: avoid duplicate lists of servers in migration class [puppet] - 10https://gerrit.wikimedia.org/r/989540 (https://phabricator.wikimedia.org/T354221) [16:31:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [16:31:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [16:32:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [16:32:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [16:33:39] (03CR) 10Dzahn: [C: 03+2] "this is only used when a phab server is replaced with new hardware, so noop now, but cleaning up the lists of host names" [puppet] - 10https://gerrit.wikimedia.org/r/989540 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [16:33:51] (03PS5) 10Btullis: Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) [16:34:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:26] jouncebot: next [16:36:26] In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1800) [16:36:44] 10SRE, 10ops-codfw, 10ops-eqiad, 10Infrastructure-Foundations: Repurpose three decom servers as temporary ganeti-test1001/1002 and ganeti-test2004 - https://phabricator.wikimedia.org/T345602 (10VRiley-WMF) [16:36:57] !log upgrade prometheus on prometheus2006 - T354399 [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10VRiley-WMF) 05Open→03Resolved [16:37:02] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [16:37:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1378.eqiad.wmnet with OS bullseye [16:39:38] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti-test1001, ganeti-test1002 - https://phabricator.wikimedia.org/T354680 (10VRiley-WMF) Scripts have been run and these have been decommissioned [16:42:39] 10SRE, 10Traffic: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718 (10Jdforrester-WMF) [16:43:27] (03PS1) 10Cathal Mooney: Remove irb.2201 gateway from codfw spines and move to lsw1-a2 [homer/public] - 10https://gerrit.wikimedia.org/r/989559 (https://phabricator.wikimedia.org/T348159) [16:44:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:44:25] PROBLEM - Memcached on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [16:44:25] PROBLEM - SSH on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:44:48] (03PS6) 10Dzahn: phabricator: avoid duplicate list of servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) [16:44:56] (03CR) 10Cathal Mooney: [C: 03+2] Remove irb.2201 gateway from codfw spines and move to lsw1-a2 [homer/public] - 10https://gerrit.wikimedia.org/r/989559 (https://phabricator.wikimedia.org/T348159) (owner: 10Cathal Mooney) [16:45:05] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:47] (03Merged) 10jenkins-bot: Remove irb.2201 gateway from codfw spines and move to lsw1-a2 [homer/public] - 10https://gerrit.wikimedia.org/r/989559 (https://phabricator.wikimedia.org/T348159) (owner: 10Cathal Mooney) [16:45:49] RECOVERY - Memcached on mw1349 is OK: TCP OK - 0.000 second response time on 10.64.48.191 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [16:45:49] RECOVERY - SSH on mw1349 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:46:39] RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:53] PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:53] PROBLEM - Host ripe-atlas-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:47:07] RECOVERY - PHP opcache health on mw2446 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:49:12] (03CR) 10Dzahn: [C: 03+2] phabricator: avoid duplicate list of servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/988112 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [16:49:46] (03PS1) 10Brouberol: spark-history: pin image tags to explicit values instead of latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) [16:50:29] (03Abandoned) 10Kamila Součková: TEMPORARY for debugging T354413: add role hiera [puppet] - 10https://gerrit.wikimedia.org/r/989192 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [16:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:52:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [16:52:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm replacement disk arrived. faulty disk was sent back to dell. new one has been put into stock. [16:54:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:54:47] (03CR) 10Btullis: "I notice that some deployments also override the docker-registry address to be docker-registry.discovery.wmcloud.org." [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [16:55:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [16:58:13] RECOVERY - Host ripe-atlas-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [16:58:13] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [17:00:01] (03PS1) 10Kamila Součková: Clean up the temporary changes for debugging T354413 [puppet] - 10https://gerrit.wikimedia.org/r/989562 [17:00:43] (03CR) 10Dzahn: [C: 03+1] "per "compiler shows no change" I like this. nitpick might be now we call it "active_server" and "passive_server" for Phabricator but prima" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [17:03:32] (03PS3) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:03:42] (03PS4) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:04:49] (03CR) 10CI reject: [V: 04-1] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [17:06:14] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate atlas-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) [17:07:29] (03CR) 10Btullis: [C: 03+2] Bring an-master1004 into service as a hadoop::standby [puppet] - 10https://gerrit.wikimedia.org/r/989214 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [17:07:47] (03PS2) 10Brouberol: spark-history: pin image tags to explicit values instead of latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) [17:07:51] (03CR) 10Brouberol: spark-history: pin image tags to explicit values instead of latest (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [17:08:36] (03PS5) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:09:14] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [17:09:24] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:09:43] (03CR) 10CI reject: [V: 04-1] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [17:09:50] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate atlas-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) 05Open→03Resolved Work completed. Cable moved and irb.2201 added to lsw1-a2-codfw. As no other devices are o... [17:10:28] (03CR) 10Btullis: spark-history: pin image tags to explicit values instead of latest (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [17:10:33] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) ` Case Number 2024-0110-046148 Case Type Tech Priority P2 - High Platform MX480 Status Dispatch [17:12:42] (03PS3) 10Brouberol: spark-history: pin image tags to explicit values instead of latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) [17:12:49] (03CR) 10Brouberol: "Good call!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [17:13:40] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9450929, @Volans wrote: > I might be missing context, but why we can't... [17:14:03] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns for sandbox1-a-codfw irb.2201 gw - cmooney@cumin1002" [17:14:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update reverse dns for sandbox1-a-codfw irb.2201 gw - cmooney@cumin1002" [17:14:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:14] (03CR) 10Kamila Součková: [C: 03+2] Clean up the temporary changes for debugging T354413 [puppet] - 10https://gerrit.wikimedia.org/r/989562 (owner: 10Kamila Součková) [17:15:22] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989563 (https://phabricator.wikimedia.org/T354517) [17:15:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1378.eqiad.wmnet with OS bullseye [17:16:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [17:16:58] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989563 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer) [17:17:44] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989563 (https://phabricator.wikimedia.org/T354517) (owner: 10Peter Fischer) [17:18:14] (03PS6) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:19:23] (03CR) 10CI reject: [V: 04-1] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [17:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [17:21:56] (03CR) 10Brouberol: [C: 03+2] spark-history: pin image tags to explicit values instead of latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [17:21:58] (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758) (owner: 10Cathal Mooney) [17:22:26] (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758) (owner: 10Cathal Mooney) [17:22:47] (03Merged) 10jenkins-bot: spark-history: pin image tags to explicit values instead of latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/989561 (https://phabricator.wikimedia.org/T354785) (owner: 10Brouberol) [17:26:05] PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:26:29] (03PS6) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [17:27:22] !log enable puppet on lvs2014: T352758 [17:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:26] T352758: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 [17:27:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 215, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:28:33] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) [17:28:51] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 297, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:28:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [17:29:27] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) Link is now up and BGP has established. ` cmooney@lsw1-a2-codfw> show route receive-protocol bgp 10.192.254.9 table PRODUCTION.inet.0 ters... [17:30:29] (03PS2) 10Stoyofuku-wmf: Disable max width for index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989262 (https://phabricator.wikimedia.org/T352162) [17:31:16] (03PS7) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:31:24] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:32:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:20] ^ expected [17:34:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 297, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:35:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:37:36] (03PS15) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [17:37:44] (03CR) 10FNegri: dologmsg: standardize logging format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [17:39:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:40:21] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs2014.codfw.wmnet [17:40:59] PROBLEM - Check systemd state on lvs2014 is CRITICAL: CRITICAL - degraded: The following units failed: ipip-multiqueue-optimizer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:41] ^ looking into it [17:44:02] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:44:29] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:46:19] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:47:04] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:49:58] (03PS1) 10Cathal Mooney: Add new codfw row a/b per-rack vlans to hieradata for lvs [puppet] - 10https://gerrit.wikimedia.org/r/989566 (https://phabricator.wikimedia.org/T352758) [17:50:42] (03CR) 10Ssingh: [C: 03+1] Add new codfw row a/b per-rack vlans to hieradata for lvs [puppet] - 10https://gerrit.wikimedia.org/r/989566 (https://phabricator.wikimedia.org/T352758) (owner: 10Cathal Mooney) [17:51:52] (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw row a/b per-rack vlans to hieradata for lvs [puppet] - 10https://gerrit.wikimedia.org/r/989566 (https://phabricator.wikimedia.org/T352758) (owner: 10Cathal Mooney) [17:52:11] (03CR) 10Dzahn: "timing! Rsync::Quickdatacopy[phabricator-home-dirs]: has no parameter named 'use_generic_firewall' :p" [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [17:52:34] (03PS8) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [17:52:36] (03CR) 10Filippo Giunchedi: [C: 03+1] lvs::realserver::ipip: Report errors on MSS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/989459 (https://phabricator.wikimedia.org/T354721) (owner: 10Vgutierrez) [17:53:40] (03CR) 10CI reject: [V: 04-1] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [17:54:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1377.eqiad.wmnet with OS bullseye [17:54:31] RECOVERY - Check systemd state on lvs2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:23] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [17:57:06] jouncebot: next [17:57:06] In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1800) [17:58:58] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: memory upgrade [17:59:24] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: memory upgrade [17:59:31] 10SRE, 10ops-codfw, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=051e0f09-36d2-4483-9f30-3af19d6d5fa5) set by filippo@cumin1002 for 4:00:00 on 1 host(s) and their services with reason... [17:59:33] (03CR) 10Hashar: [C: 04-1] contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1800) [18:03:19] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:04:35] (03PS5) 10Dzahn: contint: use the same PHP packages on contint before and after distro upgrade [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) [18:07:01] (03CR) 10Dzahn: "I really thought you would like that a contint distro upgrade does NOT change anything about the PHP packages. Not sure what the disadvan" [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:07:30] (03PS7) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [18:08:17] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:08:45] (03CR) 10Jdlrobson: [C: 03+1] Disable max width for index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989262 (https://phabricator.wikimedia.org/T352162) (owner: 10Stoyofuku-wmf) [18:09:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:10:10] (03CR) 10Dzahn: "again, the intention here is to NOT change anything and if we DONT do this, then there will be a change. if that's what you prefer, that's" [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:14:45] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:17:35] (03CR) 10Dzahn: contint: use the same PHP packages on contint before and after distro upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:18:40] someone working on prometheus2005? [18:18:46] oh nvm, it's above, thank [18:18:46] s [18:24:11] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:15] !log stop pybal on lvs2013: T352758 [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:20] T352758: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 [18:25:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:25:15] ^ expected, lvs2013, can't downtime [18:26:05] (03PS8) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [18:26:24] That might've caused a transient failure for me... maybe? [18:27:43] (03PS9) 10Dzahn: phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) [18:27:54] Reedy: what kind of failure was it? [18:28:09] (03PS1) 10Tchanders: Add comment to clarify which rate limits apply to temporary users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989569 (https://phabricator.wikimedia.org/T331576) [18:28:13] esams varnish complaining about upstream [18:28:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:28:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:28:45] ^ expected [18:28:55] good [18:29:02] Reedy: ok, unlikely to be related but let us know here if you see something more? thanks [18:29:07] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:29:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:29:20] please ignore pybal alerts on lvs2013 thanks [18:30:21] grafana says more 500 errors like 40 a second rn [18:30:31] looking [18:30:48] hmm [18:30:54] they've gone down a bit [18:31:42] (03CR) 10Dr0ptp4kt: "Added some more patterns, because here more data is probably better." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [18:32:04] seems back to normal now [18:32:17] yeah, let's see [18:33:01] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [18:34:11] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:35:27] !log filippo@cumin1002 START - Cookbook sre.hosts.remove-downtime for prometheus2005.codfw.wmnet [18:35:28] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for prometheus2005.codfw.wmnet [18:36:42] sukhe: I'm about to power down prometheus2006 FYI, 2005 is back up [18:36:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [18:37:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [18:37:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [18:37:27] godog: np! [18:37:53] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on prometheus2006.codfw.wmnet with reason: memory upgrade [18:38:03] ack thanks [18:38:08] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on prometheus2006.codfw.wmnet with reason: memory upgrade [18:38:14] 10SRE, 10ops-codfw, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5cd14798-01b6-43d4-ae86-3ca1bc26f98b) set by filippo@cumin1002 for 1:00:00 on 1 host(s) and their services with reason... [18:40:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [18:41:25] RECOVERY - Check systemd state on mw1379 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:16] (03PS9) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [18:46:23] (03CR) 10Dzahn: [C: 03+1] "ready for review now: https://puppet-compiler.wmflabs.org/output/988111/1075/" [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [18:57:59] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:59:23] (03CR) 10Kosta Harlan: [C: 03+1] Add comment to clarify which rate limits apply to temporary users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989569 (https://phabricator.wikimedia.org/T331576) (owner: 10Tchanders) [19:00:05] jeena and dduvall: Time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1900). [19:00:05] jeena and dduvall: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1900). [19:00:10] 10SRE, 10ops-codfw, 10Observability-Metrics: RAM upgrade for prometheus200[56] - https://phabricator.wikimedia.org/T354685 (10fgiunchedi) 05Open→03Resolved This is complete, prometheus codfw has 192GB of ram, thank you @Jhancock.wm and @wiki_willy for your help! [19:00:15] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10fgiunchedi) [19:00:32] o/ [19:00:51] !log disabling OSPF connection from mr1-codfw to codfw core routers T348164 [19:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:55] T348164: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 [19:02:57] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989573 (https://phabricator.wikimedia.org/T350089) [19:02:59] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989573 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:03:51] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989573 (https://phabricator.wikimedia.org/T350089) (owner: 10TrainBranchBot) [19:07:09] (03PS1) 10Cathal Mooney: Remove OSPF adjacency between codfw core routers and mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989574 (https://phabricator.wikimedia.org/T348164) [19:08:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 24, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:11:39] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:18:07] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.13 refs T350089 [19:18:22] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [19:24:46] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:25:54] (03PS1) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [19:26:06] !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.13 refs T350089 (duration: 07m 58s) [19:26:10] T350089: 1.42.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T350089 [19:27:07] (03CR) 10CI reject: [V: 04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [19:28:07] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for mr1-codfw core links - cmooney@cumin1002" [19:29:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old records for mr1-codfw core links - cmooney@cumin1002" [19:29:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:29:24] (03PS1) 10Cathal Mooney: Remove IPv6 reverse include statements for old mr1-codfw CR links [dns] - 10https://gerrit.wikimedia.org/r/989578 (https://phabricator.wikimedia.org/T348164) [19:32:32] (03CR) 10Cathal Mooney: [C: 03+2] Remove OSPF adjacency between codfw core routers and mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989574 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [19:36:35] (03CR) 10Ssingh: [C: 03+1] Remove IPv6 reverse include statements for old mr1-codfw CR links [dns] - 10https://gerrit.wikimedia.org/r/989578 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [19:41:35] (03PS10) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [19:41:56] (03CR) 10Cathal Mooney: [C: 03+2] Remove IPv6 reverse include statements for old mr1-codfw CR links [dns] - 10https://gerrit.wikimedia.org/r/989578 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [19:47:38] (03Merged) 10jenkins-bot: Remove OSPF adjacency between codfw core routers and mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989574 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [19:54:11] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:54:42] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) @ayounsi see below email from Juniper support ` Hello Papaul I went and checked the logs; I can see the following Jan 10 18:46:51 re0.cr2-codfw chassisd[32915]: CHASSISD_I2CS_READBACK_ERROR: Readback err... [19:58:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:33] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10VRiley-WMF) At eqiad we have 2 32GB 2Rx4 PC4 2666V that are available. [20:06:17] (03PS1) 10Cathal Mooney: Remove OSPF from allowed from trust zone on mr1-codfw, add BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989583 (https://phabricator.wikimedia.org/T348164) [20:06:55] (03CR) 10Cathal Mooney: [C: 03+2] Remove OSPF from allowed from trust zone on mr1-codfw, add BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989583 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [20:08:10] (03Merged) 10jenkins-bot: Remove OSPF from allowed from trust zone on mr1-codfw, add BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989583 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [20:15:05] (03PS1) 10Ssingh: hiera: temporarily set bgp-med to 101 for lvs2013 [puppet] - 10https://gerrit.wikimedia.org/r/989585 (https://phabricator.wikimedia.org/T352758) [20:15:23] (03CR) 10Marostegui: [C: 03+1] mariadb: Change nagios user identfication to be unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/989556 (owner: 10Ladsgroup) [20:17:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1076/co" [puppet] - 10https://gerrit.wikimedia.org/r/989585 (https://phabricator.wikimedia.org/T352758) (owner: 10Ssingh) [20:18:11] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: temporarily set bgp-med to 101 for lvs2013 [puppet] - 10https://gerrit.wikimedia.org/r/989585 (https://phabricator.wikimedia.org/T352758) (owner: 10Ssingh) [20:21:44] (03PS1) 10Sbisson: Enable Wikistories on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989606 (https://phabricator.wikimedia.org/T352454) [20:22:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:22:24] (03CR) 10CI reject: [V: 04-1] Enable Wikistories on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989606 (https://phabricator.wikimedia.org/T352454) (owner: 10Sbisson) [20:22:39] !log enable puppet on lvs2013: T352758 [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:43] T352758: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 [20:22:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 215, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:22:59] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:24:07] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 80 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [20:25:50] (03PS1) 10Cathal Mooney: Remove OSPF stub config for interfaces on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989607 (https://phabricator.wikimedia.org/T348164) [20:26:25] (03CR) 10Cathal Mooney: [C: 03+2] Remove OSPF stub config for interfaces on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989607 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [20:26:39] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:27:00] (03Merged) 10jenkins-bot: Remove OSPF stub config for interfaces on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/989607 (https://phabricator.wikimedia.org/T348164) (owner: 10Cathal Mooney) [20:29:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:56] (03PS2) 10Sbisson: Enable Wikistories on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989606 (https://phabricator.wikimedia.org/T352454) [20:33:53] (03PS1) 10Ssingh: Revert "hiera: temporarily set bgp-med to 101 for lvs2013" [puppet] - 10https://gerrit.wikimedia.org/r/989587 [20:34:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:37:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:23] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily set bgp-med to 101 for lvs2013" [puppet] - 10https://gerrit.wikimedia.org/r/989587 (owner: 10Ssingh) [20:42:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:56:43] jouncebot: nowandnext [20:56:44] For the next 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T1900) [20:56:44] In 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T2100) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T2100). [21:00:05] toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] o/ I can deploy [21:02:57] topranks: ping - you here? [21:03:07] oops, sorry. meant to ping toyofuku [21:03:19] Hello yes I am here! [21:03:42] great. do you have WikimediaDebug browser extension installed? [21:03:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989262 (https://phabricator.wikimedia.org/T352162) (owner: 10Stoyofuku-wmf) [21:03:57] I do! [21:04:20] cool, I'll ping you in a few moments when your patch will be available to test [21:04:36] Perfect, thank you so much for doing the deploy ☺️ [21:04:37] (03Merged) 10jenkins-bot: Disable max width for index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989262 (https://phabricator.wikimedia.org/T352162) (owner: 10Stoyofuku-wmf) [21:05:00] !log taavi@deploy2002 Started scap: Backport for [[gerrit:989262|Disable max width for index namespace (T352162)]] [21:05:14] T352162: Disable fixed width on index namespace in Wikisource - https://phabricator.wikimedia.org/T352162 [21:05:34] taavi: no worries :) [21:08:37] !log taavi@deploy2002 toyofuku and taavi: Backport for [[gerrit:989262|Disable max width for index namespace (T352162)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:32] toyofuku: your change can now be tested via the WikimediaDebug extension. please test that your change works as expected and report back once you're done [21:09:42] Testing now, thank you! [21:10:42] looking good so far, one more sec pls! [21:10:53] sure, no hurry [21:12:16] Okay, I'm confident [21:12:22] Ready to move on, thank you for your patience! [21:12:45] !log taavi@deploy2002 toyofuku and taavi: Continuing with sync [21:12:48] ok, continuing [21:19:00] Once the other patches in this backport window are done, I'd like to apply https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/989569/. I now have production access and this only modifies a comment, so happy to try this myself. [21:19:20] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:989262|Disable max width for index namespace (T352162)]] (duration: 14m 19s) [21:19:24] T352162: Disable fixed width on index namespace in Wikisource - https://phabricator.wikimedia.org/T352162 [21:19:35] toyofuku: your patch is now live [21:19:50] Dreamy_Jazz: feel free to go ahead, let me know if I can be of any assitance [21:19:54] Thanks! [21:19:58] Thank you!! [21:24:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989569 (https://phabricator.wikimedia.org/T331576) (owner: 10Tchanders) [21:25:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:25:11] (03Merged) 10jenkins-bot: Add comment to clarify which rate limits apply to temporary users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989569 (https://phabricator.wikimedia.org/T331576) (owner: 10Tchanders) [21:25:35] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:989569|Add comment to clarify which rate limits apply to temporary users (T331576)]] [21:25:38] T331576: Rate limits for Temporary account should match those for anon users - https://phabricator.wikimedia.org/T331576 [21:26:33] Dreamy_Jazz: didn't know you're a deployer now! congrats :) [21:26:56] feel free to ping me at any time for anything deployment-wise, i'd be glad to help [21:27:11] Thanks! I got access after https://phabricator.wikimedia.org/T353735 [21:27:15] !log dreamyjazz@deploy2002 dreamyjazz and tchanders: Backport for [[gerrit:989569|Add comment to clarify which rate limits apply to temporary users (T331576)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:25] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:47] !log dreamyjazz@deploy2002 dreamyjazz and tchanders: Continuing with sync [21:28:22] It has been really helpful to be able to run maintenance scripts and test out SQL on production DBs. [21:28:33] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [21:28:36] i can see that :) [21:28:51] Especially for CheckUser, as tables are private. [21:29:09] yup [21:30:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:30:16] Dreamy_Jazz: anyway, i added you to `wmf-deployment` in Gerrit as well, should you need +2 config / wmf branches manually :) [21:30:23] Thanks! [21:30:38] I had assumed that would come automatically, but thanks for doing so. [21:30:40] (03PS2) 10Ladsgroup: mariadb: Change nagios user identfication to be unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/989556 [21:30:44] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Change nagios user identfication to be unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/989556 (owner: 10Ladsgroup) [21:31:18] nope, it's all manual. but now that scap does it for you, it's not always needed [21:31:24] 👍 [21:32:00] Although it hasn't stopped me from doing this, it seems that https://deploy-commands.toolforge.org/ is down. [21:32:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:29] Dreamy_Jazz: afaik it showed the "old" way of deploment. today it's a single command. [21:32:45] Ah. I had not realised it was out of date. [21:32:45] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10wiki_willy) a:03VRiley-WMF [21:33:09] yeah. deployment used to take like six commands you had to enter in order. now we're down to one :) [21:33:15] Tbh the https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers page also seems to be a bit out of date too. [21:33:19] urbanecm: I thought it had been updated to just have the 1 [21:33:40] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:989569|Add comment to clarify which rate limits apply to temporary users (T331576)]] (duration: 08m 05s) [21:33:44] T331576: Rate limits for Temporary account should match those for anon users - https://phabricator.wikimedia.org/T331576 [21:34:01] As I didn't have to do any git commands as specified at https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Fetching_patches [21:34:06] yup yup [21:34:09] I feel like I filed a bug about that page a while ago :-) [21:34:42] All complete with no red text that I could see, so I think I'm done. [21:35:06] 10SRE, 10ops-codfw, 10ops-eqiad, 10Observability-Metrics: Investigate memory increase for Prometheus hosts in codfw/eqiad - https://phabricator.wikimedia.org/T354606 (10wiki_willy) Thanks @VRiley-WMF. I have T354684 assigned over to you, so you can work with @fgiunchedi on coordinating downtime for the up... [21:35:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:14] !log UTC late deploys done [21:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:49] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:43:01] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:43:01] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:41] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:47] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:21] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:53] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:18] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [21:57:47] Dreamy_Jazz: I took a look at the backport documentation and the "scap backport" section, which comes before the "Merging and applying patches" section explains to continue reading for manual deployment instructions. Maybe it could be made more explicit in the documentation somehow that one can either use scap backport, or manually merge and deploy [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240110T2200) [22:00:19] We're not using our window, in case someone needs to deploy. [22:05:10] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [22:10:41] 10SRE, 10Infrastructure-Foundations, 10netops: Automate BGP peering on MR routers towards core - https://phabricator.wikimedia.org/T354809 (10cmooney) p:05Triage→03Low [22:13:58] jeena: Perhaps that would be helpful. [22:16:23] (03PS2) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:17:30] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:27:16] (03PS3) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:28:24] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:29:00] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [22:29:24] (03PS14) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [22:30:04] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:30:34] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:32:01] (03PS4) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:32:47] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:10] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:35:53] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:02] (03PS15) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [22:36:13] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:36:15] (03PS5) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:36:40] (03PS6) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:36:42] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:40:15] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:42:03] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:03] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:54] (03PS7) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:46:43] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:32] (03CR) 10CI reject: [V: 04-1] wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:49:49] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:23] (03PS8) 10Ryan Kemper: wdqs-test: Enable PKI [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [22:54:27] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:27] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989244 (https://phabricator.wikimedia.org/T354555) (owner: 10Bking) [23:34:53] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:53] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:59] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:59] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:39] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:45] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:25] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:25] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:03] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:03] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:19] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:56:37] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:37] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state