[00:49:22] (03PS1) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 [00:51:51] (03CR) 10CI reject: [V:04-1] mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (owner: 10Scott French) [00:51:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053813 (owner: 10TrainBranchBot) [00:55:33] (03PS2) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 [01:02:52] (03PS3) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) [01:13:22] (03CR) 10Scott French: "Reuven, here is the change we discussed out of band. Thanks in advance for review!" [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:25] FIRING: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:25] RESOLVED: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:03] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:52:23] (03PS1) 10Scott French: WIP: switchdc: prepare mediawiki cache warmup for bare-metal turndown [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 [01:54:11] (03CR) 10Scott French: "Something like this is kind of what I had in mind: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1053823" [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [01:57:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 24 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:35] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:11] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:15:11] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 26.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:52:22] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host netboxdb2003.codfw.wmnet with OS bookworm [03:52:23] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netboxdb2003.codfw.wmnet [04:14:11] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 464.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:30:27] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (netbox1003, ...), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:47:43] (03PS1) 10Marostegui: Revert "db2127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1053826 [04:48:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66365 and previous config saved to /var/cache/conftool/dbconfig/20240712-044802-root.json [04:48:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9975782 (10Papaul) @wiki_willy I did more tests on this pxe boot issue we are having with the 10G Dell NIC card by taking one of the decommissioned se... [04:49:17] (03CR) 10Marostegui: [C:03+2] Revert "db2127: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1053826 (owner: 10Marostegui) [05:03:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66366 and previous config saved to /var/cache/conftool/dbconfig/20240712-050307-root.json [05:05:29] ACKNOWLEDGEMENT - SSH on db1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui T369855 https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:05:29] ACKNOWLEDGEMENT - Host db1179 #page is DOWN: PING CRITICAL - Packet loss = 100% Marostegui T369855 [05:08:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136', diff saved to https://phabricator.wikimedia.org/P66367 and previous config saved to /var/cache/conftool/dbconfig/20240712-050800-root.json [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:13] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:16:03] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) [05:18:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66368 and previous config saved to /var/cache/conftool/dbconfig/20240712-051813-root.json [05:30:27] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66369 and previous config saved to /var/cache/conftool/dbconfig/20240712-053318-root.json [05:48:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66370 and previous config saved to /var/cache/conftool/dbconfig/20240712-054824-root.json [05:52:34] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9975815 (10phaultfinder) [05:58:24] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9975816 (10Marostegui) @Papaul I cannot access the host via ssh remotely, but the host is up and has network. I've connected via supermicro idrac and I think it is related to... [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240712T0600) [06:03:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66371 and previous config saved to /var/cache/conftool/dbconfig/20240712-060329-root.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66372 and previous config saved to /var/cache/conftool/dbconfig/20240712-061835-root.json [06:37:48] !log Starting MediaModeration scan on commons after it crashed last night due to database issues - https://wikitech.wikimedia.org/wiki/MediaModeration [06:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: article_descriptions from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053536 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [06:43:02] (03Merged) 10jenkins-bot: ml-services: article_descriptions from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053536 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [06:45:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T367856)', diff saved to https://phabricator.wikimedia.org/P66373 and previous config saved to /var/cache/conftool/dbconfig/20240712-064518-marostegui.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240712T0700) [07:00:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P66374 and previous config saved to /var/cache/conftool/dbconfig/20240712-070026-marostegui.json [07:04:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:22] (03PS1) 10Slyngshede: mediawiki: fail gracefully on missing LDAP field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1053837 [07:15:00] (03CR) 10Slyngshede: [C:03+2] mediawiki: fail gracefully on missing LDAP field. [software/bitu] - 10https://gerrit.wikimedia.org/r/1053837 (owner: 10Slyngshede) [07:15:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P66375 and previous config saved to /var/cache/conftool/dbconfig/20240712-071533-marostegui.json [07:22:01] (03PS1) 10DCausse: Re-add CirrusSearch prefix to statsd metrics [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053838 (https://phabricator.wikimedia.org/T359033) [07:24:09] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [07:30:40] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [07:30:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T367856)', diff saved to https://phabricator.wikimedia.org/P66376 and previous config saved to /var/cache/conftool/dbconfig/20240712-073040-marostegui.json [07:30:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [07:30:45] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:30:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [07:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T367856)', diff saved to https://phabricator.wikimedia.org/P66377 and previous config saved to /var/cache/conftool/dbconfig/20240712-073102-marostegui.json [07:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:40:01] (03CR) 10Elukey: [C:03+1] "Totally ok with those!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [07:43:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 238800416 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:44:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 86152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:45:28] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enable multiprocessing for arwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053835 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [07:51:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367856)', diff saved to https://phabricator.wikimedia.org/P66379 and previous config saved to /var/cache/conftool/dbconfig/20240712-075100-marostegui.json [07:51:04] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:58:38] (03CR) 10Jelto: [C:03+2] Bump all buildkit image tags to wmf-v0.15.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1053774 (https://phabricator.wikimedia.org/T369862) (owner: 10Ahmon Dancy) [08:00:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:02:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P66380 and previous config saved to /var/cache/conftool/dbconfig/20240712-080607-marostegui.json [08:09:29] (03CR) 10DCausse: [C:03+2] "backporting" [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053838 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [08:12:02] (03CR) 10Jelto: [C:03+2] gitlab: switch gitlab-replica-b from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053306 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:13:19] (03PS2) 10Elukey: profile::puppetserver::gitprivate: fix post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) [08:13:19] (03PS3) 10Elukey: profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) [08:13:19] (03PS4) 10Elukey: profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) [08:14:07] (03CR) 10Elukey: "Ok I think now the code should be correct, lemme know if it makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:15:32] (03CR) 10JMeybohm: "> But it might be worth checking with the repo contributors whether they actually use or want to use the feature, because it doesn't seem " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:18:35] (03PS1) 10Jelto: Revert "gitlab: switch gitlab-replica-b from iptables to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1053873 [08:19:29] (03CR) 10DCausse: [C:04-2] Re-add CirrusSearch prefix to statsd metrics [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053838 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [08:19:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3207/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:20:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053874 [08:20:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053874 (owner: 10TrainBranchBot) [08:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P66381 and previous config saved to /var/cache/conftool/dbconfig/20240712-082114-marostegui.json [08:22:57] (03CR) 10Jelto: [C:03+2] Revert "gitlab: switch gitlab-replica-b from iptables to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1053873 (owner: 10Jelto) [08:23:43] (03CR) 10Jelto: [C:03+2] "puppet fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/1053306 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:23:47] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1053801 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [08:25:20] (03CR) 10Volans: [C:04-1] vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:30:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:35:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367856)', diff saved to https://phabricator.wikimedia.org/P66382 and previous config saved to /var/cache/conftool/dbconfig/20240712-083621-marostegui.json [08:36:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:36:27] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:36:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:36:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T367856)', diff saved to https://phabricator.wikimedia.org/P66383 and previous config saved to /var/cache/conftool/dbconfig/20240712-083644-marostegui.json [08:37:13] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:37:14] (03CR) 10DCausse: [C:03+2] "backporting" [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053838 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [08:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:38:16] (03CR) 10AOkoth: vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:42:13] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 41.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:19] !log tweak benthos@webrequest_live output batching on centrallog2001 - T369737 [08:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:22] T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic - https://phabricator.wikimedia.org/T369737 [08:46:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053874 (owner: 10TrainBranchBot) [08:57:02] (03PS1) 10Jelto: gitlab: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) [08:59:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855 [08:59:25] T369855: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855 [08:59:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855 [08:59:58] (03CR) 10Klausman: [C:03+1] kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [09:00:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:01:20] 10ops-eqiad, 06DBA, 06DC-Ops: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9975979 (10ABran-WMF) I am unable to reach it via management interface either, it might need a bit of hands on [09:02:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:42] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9975993 (10LSobanski) There are now five other similar alerts that are over a month old: Linting problems found for DatahubNextServiceUnavai... [09:03:55] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3208/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:04:08] (03Merged) 10jenkins-bot: Re-add CirrusSearch prefix to statsd metrics [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053838 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [09:04:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:05:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:05:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:05:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:05:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T367781)', diff saved to https://phabricator.wikimedia.org/P66384 and previous config saved to /var/cache/conftool/dbconfig/20240712-090527-arnaudb.json [09:05:31] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:08:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66385 and previous config saved to /var/cache/conftool/dbconfig/20240712-090849-arnaudb.json [09:10:04] !log upgrade httpd version in production (bullseye/bookworm) for T369885 [09:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:56] !log dcausse@deploy1002 Started scap sync-world: Backport for [[gerrit:1053838|Re-add CirrusSearch prefix to statsd metrics (T359033)]] [09:10:59] T359033: EPIC: Convert CirrusSearch metrics to statslib - https://phabricator.wikimedia.org/T359033 [09:11:03] (03CR) 10Jelto: [V:03+1] "I think this is needed first before we can migrate to nftables. In PCC the resource names change (underscores) but Ferm::Service is still " [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:12:46] (03PS1) 10Jelto: gitlab: switch gitlab-replica-b from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) [09:13:27] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1053838|Re-add CirrusSearch prefix to statsd metrics (T359033)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:09] (03PS1) 10Filippo Giunchedi: benthos: adjust webrequest_live batching [puppet] - 10https://gerrit.wikimedia.org/r/1053881 (https://phabricator.wikimedia.org/T369737) [09:15:43] !log dcausse@deploy1002 dcausse: Continuing with sync [09:16:31] (03CR) 10Elukey: [C:03+1] benthos: adjust webrequest_live batching [puppet] - 10https://gerrit.wikimedia.org/r/1053881 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [09:17:09] (03CR) 10Filippo Giunchedi: "Unfortunately to test this I had to test-in-production on centrallog2002, where this change is live now and puppet disabled. I was able to" [puppet] - 10https://gerrit.wikimedia.org/r/1053881 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [09:20:40] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1053838|Re-add CirrusSearch prefix to statsd metrics (T359033)]] (duration: 09m 44s) [09:20:44] T359033: EPIC: Convert CirrusSearch metrics to statslib - https://phabricator.wikimedia.org/T359033 [09:23:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66386 and previous config saved to /var/cache/conftool/dbconfig/20240712-092354-arnaudb.json [09:23:59] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:24:01] (03CR) 10Jelto: [V:04-1] "pcc fails with "Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function " [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:30:09] (03PS1) 10Urbanecm: CommunityConfiguration: Release to all Growth wikis, except frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) [09:30:42] (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053884 (https://phabricator.wikimedia.org/T366458) (owner: 10Urbanecm) [09:37:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367856)', diff saved to https://phabricator.wikimedia.org/P66387 and previous config saved to /var/cache/conftool/dbconfig/20240712-093700-marostegui.json [09:37:05] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:39:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66388 and previous config saved to /var/cache/conftool/dbconfig/20240712-093900-arnaudb.json [09:39:04] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:39:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053885 [09:39:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053885 (owner: 10TrainBranchBot) [09:44:27] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: adjust webrequest_live batching [puppet] - 10https://gerrit.wikimedia.org/r/1053881 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [09:52:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P66389 and previous config saved to /var/cache/conftool/dbconfig/20240712-095207-marostegui.json [09:52:16] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:52:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:52:48] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9976086 (10phaultfinder) [09:53:39] !log temp stop benthos@webrequest_live on centrallog1002 - T369737 [09:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic - https://phabricator.wikimedia.org/T369737 [09:54:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66391 and previous config saved to /var/cache/conftool/dbconfig/20240712-095405-arnaudb.json [09:54:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:57:30] (03CR) 10Stevemunene: [C:03+1] wdqs graph split: route / to miscweb microsite [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [09:59:47] (03PS2) 10Jelto: gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) [10:05:25] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [10:05:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053885 (owner: 10TrainBranchBot) [10:07:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P66392 and previous config saved to /var/cache/conftool/dbconfig/20240712-100714-marostegui.json [10:09:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66393 and previous config saved to /var/cache/conftool/dbconfig/20240712-100910-arnaudb.json [10:09:14] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:14:22] (03CR) 10Clément Goubert: [C:03+1] cxserver: update outdated comments on chart values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053805 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [10:15:06] (03CR) 10Clément Goubert: [C:03+1] mobileapps: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053806 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [10:15:18] (03CR) 10Clément Goubert: [C:03+1] push-notifications: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053807 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [10:15:48] (03CR) 10Clément Goubert: [C:03+1] wikifeeds: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [10:18:08] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9976121 (10Krd) The problem occurs just now, created one account, cannot create another one. [10:18:44] !log stop benthos@webrequest_live on centrallog2002 and start it on centrallog1002 - T369737 [10:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:48] T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic - https://phabricator.wikimedia.org/T369737 [10:22:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367856)', diff saved to https://phabricator.wikimedia.org/P66394 and previous config saved to /var/cache/conftool/dbconfig/20240712-102221-marostegui.json [10:22:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:22:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:22:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:22:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T367856)', diff saved to https://phabricator.wikimedia.org/P66395 and previous config saved to /var/cache/conftool/dbconfig/20240712-102243-marostegui.json [10:24:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: stopping T367781', diff saved to https://phabricator.wikimedia.org/P66396 and previous config saved to /var/cache/conftool/dbconfig/20240712-102416-arnaudb.json [10:24:20] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:26:27] (03PS1) 10Filippo Giunchedi: benthos: further tweak webrequest_live out batching and threads [puppet] - 10https://gerrit.wikimedia.org/r/1053890 (https://phabricator.wikimedia.org/T369737) [10:50:04] (03CR) 10Elukey: [C:03+1] benthos: further tweak webrequest_live out batching and threads [puppet] - 10https://gerrit.wikimedia.org/r/1053890 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240712T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240712T1100). [11:04:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9976186 (10ArielGlenn) 05In progress→03Resolved a:05ArielGlenn→03Dzahn Hey Daniel, I'd just assumed that getting added to the analytics-privatedata-users grou... [11:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:38:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9976274 (10cmooney) 05Open→03Resolved [11:51:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:54:45] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: further tweak webrequest_live out batching and threads [puppet] - 10https://gerrit.wikimedia.org/r/1053890 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [11:56:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:30:02] (03CR) 10Marostegui: mysqld-exporter: hotfix config for es1 to es5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [12:43:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:51:28] (03PS1) 10JMeybohm: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) [12:52:18] (03CR) 10CI reject: [V:04-1] Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [12:54:10] hnowlan: I think the jobrunners are suffering with the increased concurrency [12:54:17] s/jobrunners/videoscalers/ [12:59:45] (03CR) 10Ssingh: [C:03+1] "Looks good! We should merge this on Monday now, even though it is a safe change." [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [13:00:36] (03CR) 10CDanis: [C:03+1] benthos: further tweak webrequest_live out batching and threads [puppet] - 10https://gerrit.wikimedia.org/r/1053890 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [13:03:04] (03PS1) 10Elukey: cfssl: add a condition to cfssl_ocsprefresh.py [puppet] - 10https://gerrit.wikimedia.org/r/1053913 (https://phabricator.wikimedia.org/T363829) [13:03:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:03:29] ugh it's a bunch of normal transcodes enqueued all at once this morning [13:03:54] (03PS1) 10Hashar: cache::text: remove git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [13:04:01] (03PS2) 10Dzahn: cache::text: remove git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) [13:04:38] (03CR) 10Hashar: [C:03+1] "Please remove it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [13:07:44] (03PS3) 10Hashar: cache::text: remove git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [13:08:24] (03CR) 10Cathal Mooney: [C:03+2] Announce Anycast ranges from Network POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1052086 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [13:08:53] (03Merged) 10jenkins-bot: Announce Anycast ranges from Network POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1052086 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [13:09:06] (03CR) 10Hashar: cache::text: remove git.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [13:09:56] (03PS4) 10Hashar: phabricator: remove git.wikimedia.org vhost, rewrites and tests [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [13:10:20] !log pushing updated BGP policy to cr2-eqord and cr2-eqdfw to announce Anycast ranges from network pops (T367439) [13:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:24] T367439: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:27] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:12:34] (03PS4) 10JMeybohm: Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) [13:12:34] (03PS2) 10JMeybohm: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) [13:13:13] (03CR) 10CI reject: [V:04-1] Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [13:13:35] (03CR) 10CI reject: [V:04-1] Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:03] (03PS4) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [13:18:13] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:18:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:21] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:19:48] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:21:30] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:21:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:38] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:21:51] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:22:44] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:22:49] (03PS1) 10Jelto: gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) [13:24:59] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3211/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [13:30:11] (03PS2) 10Jelto: gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) [13:31:05] (03CR) 10Hashar: git: remove umask from git::clone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:31:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3212/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [13:33:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:33:23] (03PS14) 10Hashar: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) [13:35:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9976576 (10Jhancock.wm) dbproxy2006 temp 1G -> B7 lsw port 47 dbproxy2007 temp 1G -> C7 asw port 43 dbproxy2008 temp 1G -> D4 asw port 43 [13:36:20] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9976580 (10fgiunchedi) >>! In T369825#9974444, @VRiley-WMF wrote: > @wiki_willy Yes, I was able to locate one. @fgiunchedi is there an estimated time and date for us to bring the server down and ins... [13:36:24] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9976578 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [13:37:39] (03PS3) 10JMeybohm: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) [13:37:42] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9976583 (10fgiunchedi) Thank you @Jhancock.wm that's great! Please LMK a day and time of next week that would work for you [13:39:01] (03CR) 10CI reject: [V:04-1] Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [13:39:15] claime: I can scale it back :/ ideally letting it catch up would free us up but it's just noise for now [13:40:09] (03PS1) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [13:41:13] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:43:01] (03PS1) 10Hnowlan: Revert "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053932 [13:43:29] (03CR) 10Ssingh: [V:03+1] "Schema for this is still being discussed in T369366. But we can update this CR if that changes." [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:43:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:45:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:54:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9976652 (10Papaul) @Marostegui thank you for checking. You are right looks like the host still has it's IPV6 or we remove it after the re-image in netbox. ` 2: ens1f0np0: (03PS1) 10JMeybohm: New upstream version 3.11.3 [debs/helm3] - 10https://gerrit.wikimedia.org/r/1053934 (https://phabricator.wikimedia.org/T368251) [13:58:17] (03PS1) 10Cathal Mooney: Adjust route generation for Anycast ranges at eqord [homer/public] - 10https://gerrit.wikimedia.org/r/1053935 (https://phabricator.wikimedia.org/T367439) [13:59:12] (03PS1) 10Elukey: pki: add the Traffic's project Puppet CA to client_auth_CA.pem in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) [14:00:18] (03CR) 10Elukey: "John did a similar thing for deployment-prep: https://phabricator.wikimedia.org/rOPUP4a6f0f36f396ab924a499d061723ec869eac1ee3" [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [14:02:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:04:23] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976672 (10Joe) A couple of notes: * Overriding `name` to always be the same, and just one object per tag group, makes syncing and querying less efficient an... [14:06:46] (03CR) 10Vgutierrez: [C:04-1] pki: add the Traffic's project Puppet CA to client_auth_CA.pem in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [14:10:23] (03CR) 10JMeybohm: "Package builds fine locally..." [debs/helm3] - 10https://gerrit.wikimedia.org/r/1053934 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [14:11:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:15:25] (03CR) 10Hnowlan: [C:03+2] Revert "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053932 (owner: 10Hnowlan) [14:16:19] (03Merged) 10jenkins-bot: Revert "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053932 (owner: 10Hnowlan) [14:16:37] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9976703 (10Jhancock.wm) a:03Jhancock.wm [14:18:29] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:19:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:19:27] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:20:32] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:21:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:21:39] (03CR) 10Vgutierrez: [C:04-1] "Adding brett here since he took care of deploying the new puppetserver. Puppet CA on traffic-puppetserver-bookworm:" [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [14:24:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 50 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:26:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:34:08] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d3-codfw [14:36:19] !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d3-codfw [14:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:38] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976790 (10ssingh) Thanks for the feedback @Joe! >>! In T369366#9976672, @Joe wrote: > A couple of notes: > > * Overriding `name` to always be the same, and... [14:44:47] (03PS4) 10Ssingh: P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) [14:45:03] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:55] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:45:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:01] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:04] topranks: ^ [14:46:24] * topranks looking [14:46:26] also wtf [14:46:26] <3 [14:46:28] yeap [14:46:30] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:46:46] actually just probably the transport down [14:49:08] (03CR) 10Ssingh: "Final form:" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:49:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:50:06] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976817 (10ssingh) Final (famous last words) form: ` confctl --object-type geodns select 'geodns=generic-map,name=eqiad' get ` with the key being: ` /conft... [14:53:19] (03PS1) 10Clément Goubert: Reimage 3 kubernetes servers to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1053940 [14:54:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:55:19] (03PS3) 10Cwhite: opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) [14:55:40] !log Draining and depooling mw1349, mw1350, mw1351 for reimage as jobrunners [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:21] (03PS4) 10Cwhite: opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) [14:56:33] sukhe: it's a planned maintenance, in the calendar [14:57:21] ah ok. sorry for the noise! [14:57:35] (but since we just sent Brazil yesterday, wanted to make sure... :) [14:58:02] (03PS2) 10Clément Goubert: Reimage 3 kubernetes servers to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1053940 [14:58:18] heh yep [14:58:25] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=(mw1349.eqiad.wmnet|mw1350.eqiad.wmnet|mw1351.eqiad.wmnet),cluster=kubernetes,service=kubesvc [14:58:38] traffic to eqiad routing via the transport to dfw (Dallas Forth-Worth) [14:58:43] https://www.irccloud.com/pastebin/tUzRyBXL/ [14:58:51] (03CR) 10Hnowlan: [C:03+1] Reimage 3 kubernetes servers to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1053940 (owner: 10Clément Goubert) [14:59:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:47] that circuit was already the "busy one" of the two [15:02:42] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:03:01] (03CR) 10Clément Goubert: [C:03+2] Reimage 3 kubernetes servers to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1053940 (owner: 10Clément Goubert) [15:03:03] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:04:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T367856)', diff saved to https://phabricator.wikimedia.org/P66404 and previous config saved to /var/cache/conftool/dbconfig/20240712-150400-marostegui.json [15:04:05] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:04:15] FIRING: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [15:04:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:04:41] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:04:53] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:06:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1349.eqiad.wmnet with OS buster [15:06:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1350.eqiad.wmnet with OS buster [15:07:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1351.eqiad.wmnet with OS buster [15:07:23] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:07:28] (03PS2) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [15:07:56] (03CR) 10Ssingh: "admin_state.tpl.erb updated for the schema change." [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:08:04] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:08:29] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:29] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:09:03] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1407.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1420.eqiad.wmnet, mw1437.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:09:57] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1420.eqiad.wmnet, mw1437.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1420.eqiad.wmnet, mw1437.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:24] videoscalers? [15:10:35] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:10:38] hmm [15:10:42] flapping [15:11:33] RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:11:37] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:11:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:12:43] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [15:13:33] FIRING: [2x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:14:21] (03PS1) 10Scott French: commons-impact-analytics: bump image to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053944 (https://phabricator.wikimedia.org/T369745) [15:14:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:15:22] !log homer 'cr*eqiad*' commit 'videoscaler reimages mw1349/mw135[01]' [15:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:00] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:17:21] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:17:22] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:17:25] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053944 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [15:17:47] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:18:04] (03CR) 10Scott French: [C:03+2] commons-impact-analytics: bump image to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053944 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [15:18:33] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:18:43] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:57] (03Merged) 10jenkins-bot: commons-impact-analytics: bump image to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053944 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [15:19:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P66405 and previous config saved to /var/cache/conftool/dbconfig/20240712-151907-marostegui.json [15:19:17] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:23] ^ reimages [15:19:37] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:20:08] weird that they weren't downtimed after their reboots [15:20:09] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:11] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:20:11] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:20:11] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:20:15] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:32] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:20:35] RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:20:39] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:20:45] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:21:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1349.eqiad.wmnet with reason: host reimage [15:21:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [15:21:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:21:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [15:21:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:23:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1349.eqiad.wmnet with reason: host reimage [15:25:58] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [15:26:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [15:26:20] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [15:30:46] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:31:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:55] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [15:33:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [15:33:14] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [15:34:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P66406 and previous config saved to /var/cache/conftool/dbconfig/20240712-153414-marostegui.json [15:35:35] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:50] (03CR) 10Vgutierrez: [C:04-1] "PCC isn't happy: https://puppet-compiler.wmflabs.org/output/1041705/3215/." [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:40:31] (03CR) 10Vgutierrez: [C:04-1] varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:42:08] hnowlan: i bet they alerted because I had to clear them and reimage as new hosts, and there wasn't a puppet run on the alerting nodes in between [15:42:10] (03CR) 10Scott French: "Thanks, Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1053801 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:42:21] (03CR) 10Scott French: [C:03+2] mediawiki: update siteinfo URL to use mw-api-int [software/spicerack] - 10https://gerrit.wikimedia.org/r/1053801 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:43:51] (03CR) 10CDanis: [C:03+1] cfssl: add a condition to cfssl_ocsprefresh.py [puppet] - 10https://gerrit.wikimedia.org/r/1053913 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey) [15:45:36] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:46:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [15:46:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2001.codfw.wmnet [15:46:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [15:47:40] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:47:48] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:48:04] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:49:02] (03Merged) 10jenkins-bot: mediawiki: update siteinfo URL to use mw-api-int [software/spicerack] - 10https://gerrit.wikimedia.org/r/1053801 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:49:15] RESOLVED: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [15:49:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T367856)', diff saved to https://phabricator.wikimedia.org/P66407 and previous config saved to /var/cache/conftool/dbconfig/20240712-154921-marostegui.json [15:49:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [15:49:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:49:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [15:49:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T367856)', diff saved to https://phabricator.wikimedia.org/P66408 and previous config saved to /var/cache/conftool/dbconfig/20240712-154954-marostegui.json [15:55:48] (03CR) 10Scott French: [C:03+2] cxserver: update outdated comments on chart values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053805 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:56:00] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9977109 (10elukey) [15:56:31] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:56:45] (03Merged) 10jenkins-bot: cxserver: update outdated comments on chart values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053805 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:56:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9977114 (10elukey) [15:57:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9977118 (10elukey) Next steps: * Wait for https://gitlab.wikimedia.org/repos/sre/python-release/-/merge_requests/1 to be reviewed and merged. * Test p... [15:57:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2001.codfw.wmnet [15:57:51] !log cgoubert@cumin1002 conftool action : set/pooled=no:weight=10; selector: name=(mw1349|mw1350|mw1351).eqiad.wmnet,cluster=jobrunner [15:58:16] (03CR) 10JHathaway: [C:03+1] "looks correct" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:58:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1349.eqiad.wmnet with OS buster [15:59:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1350.eqiad.wmnet with OS buster [16:00:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1349.eqiad.wmnet [16:00:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1349.eqiad.wmnet [16:01:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1350.eqiad.wmnet [16:01:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1350.eqiad.wmnet [16:01:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1351.eqiad.wmnet with OS buster [16:03:58] !log cgoubert@cumin1002 conftool action : set/pooled=no:weight=10; selector: name=(mw1349|mw1350|mw1351).eqiad.wmnet,cluster=(jobrunner|videoscaler) [16:04:52] !log pooling mw1349, mw1350, mw1351 as jobrunners [16:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=(mw1349|mw1350|mw1351).eqiad.wmnet,cluster=(jobrunner|videoscaler) [16:05:58] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=(mw1349|mw1350|mw1351).eqiad.wmnet,cluster=(jobrunner|videoscaler) [16:08:34] ugh. bad reimage, reimaging them again [16:08:49] (03PS6) 10Elukey: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [16:09:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1349.eqiad.wmnet with OS buster [16:09:36] (03CR) 10CI reject: [V:04-1] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [16:09:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1350.eqiad.wmnet with OS buster [16:09:54] (03CR) 10Elukey: "Added a little bit more verbosity to the log that tells if an image is not supported, so that we'll know straight away from the logs what " [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [16:10:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9977163 (10Papaul) I checked on sretest2001 it's trying to boot with PXELINUX version 6.03 [16:10:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1351.eqiad.wmnet with OS buster [16:16:58] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:17:04] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:19:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:32] (03CR) 10CDanis: [C:03+1] profile::puppetserver::gitprivate: fix post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [16:23:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1349.eqiad.wmnet with reason: host reimage [16:24:01] (03PS7) 10Elukey: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [16:24:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [16:24:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [16:27:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1349.eqiad.wmnet with reason: host reimage [16:29:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [16:29:31] (03CR) 10Ebernhardson: [C:03+1] "good for deployment in backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE)) [16:29:35] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9977307 (10Jhancock.wm) a:03Jhancock.wm [16:29:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9977311 (10Jhancock.wm) [16:30:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9977312 (10Jhancock.wm) a:03Jhancock.wm [16:32:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [16:35:00] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920 (10RobH) 03NEW [16:35:28] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#9977352 (10RobH) [16:42:15] FIRING: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [16:44:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:52:15] RESOLVED: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [16:53:22] (03PS1) 10Andrew Bogott: deployment-prep: replace deploy03 with deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1053956 (https://phabricator.wikimedia.org/T327742) [16:58:13] (03PS4) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) [16:59:30] (03PS2) 10Andrew Bogott: deployment-prep: replace deploy03 with deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1053956 (https://phabricator.wikimedia.org/T327742) [17:00:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1349.eqiad.wmnet with OS buster [17:01:07] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1003 - https://phabricator.wikimedia.org/T369922 (10RobH) 03NEW [17:01:29] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1003 - https://phabricator.wikimedia.org/T369922#9977461 (10RobH) [17:02:45] (03PS3) 10Andrew Bogott: deployment-prep: replace deploy03 with deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1053956 (https://phabricator.wikimedia.org/T327742) [17:03:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1350.eqiad.wmnet with OS buster [17:06:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1351.eqiad.wmnet with OS buster [17:07:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw[1350-1351].eqiad.wmnet [17:07:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1350-1351].eqiad.wmnet [17:07:32] !log hnowlan@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1349.eqiad.wmnet [17:07:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1349.eqiad.wmnet [17:09:22] (03PS5) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) [17:10:27] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=(mw1349.eqiad.wmnet|mw1350.eqiad.wmnet|mw1351.eqiad.wmnet) [17:11:00] (03CR) 10Dzahn: gitlab: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:11:25] (03CR) 10Scott French: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:11:27] (03CR) 10BCornwall: "I didn't deploy the new puppetserver. I'll take care of recert signing." [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:52] (03CR) 10Dzahn: gitlab: introduce log rotation settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [17:27:00] (03CR) 10Dzahn: [C:03+1] "left some small nitpicks about spelling but looks good to me regardless" [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [17:29:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9977527 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @Papaul idrac, bios, and new pwd set. ports are as follows. ETH0 <-> FASW-C8A eth-0/0/15 ETH1 <-> FASW-C8B eth-0/0/15 [17:32:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9977530 (10Jhancock.wm) a:05Jhancock.wm→03Papaul idrac, bios, pwd are set. ports are as follows. frand2001 ETH0 <-> FASW-C8A eth-0/0/17 ETH1 <-> FASW-C8B eth-0/0/17 fra... [17:32:28] (03CR) 10Dzahn: gitlab: switch gitlab from iptables to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:33:00] (03PS4) 10Scott French: switchdc: prepare mediawiki cache warmup for bare-metal turndown [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) [17:33:47] (03CR) 10Dzahn: "we can still switch it to firewall::service first and then only change the firewall::provider based on ./hosts/ to switch only the replica" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:35:36] (03CR) 10Scott French: "I suspect this is the best interim solution until we sort out a plan of record in T369921." [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:44:52] (03CR) 10Thcipriani: "Thanks for this, we'll need to coordinate this with swapping over the beta periodic update jobs." [puppet] - 10https://gerrit.wikimedia.org/r/1053956 (https://phabricator.wikimedia.org/T327742) (owner: 10Andrew Bogott) [17:49:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:01:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:13:34] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:38] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:14:20] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:14:24] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:16:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367856)', diff saved to https://phabricator.wikimedia.org/P66410 and previous config saved to /var/cache/conftool/dbconfig/20240712-181632-marostegui.json [18:16:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:20:26] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:20:26] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:20:36] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:38] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:31:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P66411 and previous config saved to /var/cache/conftool/dbconfig/20240712-183140-marostegui.json [18:43:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:46:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P66412 and previous config saved to /var/cache/conftool/dbconfig/20240712-184647-marostegui.json [18:48:15] FIRING: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [18:49:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1350.eqiad.wmnet, mw1351.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1350.eqiad.wmnet, mw1351.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:51:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1350.eqiad.wmnet, mw1351.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1350.eqiad.wmnet, mw1351.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:52:25] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931 (10RobH) 03NEW [18:52:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:53:14] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#9977789 (10RobH) [18:54:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:16] (03CR) 10Tchanders: [C:03+1] [CheckUser] Remove wgCheckUserEventTablesMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) (owner: 10Dreamy Jazz) [19:01:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367856)', diff saved to https://phabricator.wikimedia.org/P66413 and previous config saved to /var/cache/conftool/dbconfig/20240712-190154-marostegui.json [19:01:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [19:02:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [19:02:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [19:02:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:02:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:02:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66414 and previous config saved to /var/cache/conftool/dbconfig/20240712-190224-marostegui.json [19:06:49] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935 (10RobH) 03NEW [19:07:42] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#9977867 (10RobH) [19:10:49] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#9977873 (10Dwisehaupt) [19:11:27] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#9977876 (10Dwisehaupt) Updated the host name in the task since I hadn't bumped it up in the racking details of the parent task. Sorry about that. [19:12:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:17:57] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937 (10RobH) 03NEW [19:18:33] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:18:36] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#9977917 (10RobH) [19:30:25] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1437:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:39:42] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940 (10RobH) 03NEW [19:40:16] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#9977994 (10RobH) [19:45:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:45:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:45:36] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:48:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1420.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1445.eqiad.wmnet, mw1420.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:48:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1420.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1445.eqiad.wmnet, mw1420.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:49:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:06:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053297 (https://phabricator.wikimedia.org/T366546) (owner: 10Dreamy Jazz) [20:09:46] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: replace deploy03 with deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1053956 (https://phabricator.wikimedia.org/T327742) (owner: 10Andrew Bogott) [20:23:34] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942 (10RobH) 03NEW [20:24:03] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#9978089 (10RobH) [20:42:00] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947 (10RobH) 03NEW [20:42:29] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#9978194 (10RobH) [20:58:24] 10ops-drmrs: determine cable ID for CRT-008647 - https://phabricator.wikimedia.org/T369951 (10RobH) 03NEW p:05Triage→03Medium [21:00:28] 06SRE, 06serviceops: deployment_server bullseye - mw-cgroup.service: Failed - https://phabricator.wikimedia.org/T363957#9978281 (10thcipriani) Thanks for documenting this, ran into the same thing in deployment prep (T327742), reboot also fixed it there. [21:06:07] (03CR) 10Brennen Bearnes: [C:03+1] phabricator: remove git.wikimedia.org vhost, rewrites and tests [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:18] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66415 and previous config saved to /var/cache/conftool/dbconfig/20240712-214642-marostegui.json [21:46:46] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [21:57:54] (03CR) 10Cwhite: [C:03+2] opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [21:57:58] (03CR) 10Cwhite: [C:03+2] opensearch: add watermarks to instance params [puppet] - 10https://gerrit.wikimedia.org/r/1053682 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [21:59:59] (03PS1) 10Thcipriani: Beta: update deployment-deploy04 IP [puppet] - 10https://gerrit.wikimedia.org/r/1053995 [22:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P66416 and previous config saved to /var/cache/conftool/dbconfig/20240712-220149-marostegui.json [22:09:18] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:52] (03CR) 10Clare Ming: [C:04-1] "Thank you @cgoubert@wikimedia.org for providing some clear next steps! I will work on implementing your suggestions here soon (I'm ooo for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [22:16:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P66417 and previous config saved to /var/cache/conftool/dbconfig/20240712-221656-marostegui.json [22:21:32] !log removing 1 file for legal compliance [22:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66418 and previous config saved to /var/cache/conftool/dbconfig/20240712-223204-marostegui.json [22:32:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [22:32:08] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:32:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [22:32:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T367856)', diff saved to https://phabricator.wikimedia.org/P66419 and previous config saved to /var/cache/conftool/dbconfig/20240712-223226-marostegui.json [22:34:02] !log removing 1 file for legal compliance [22:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:30] FIRING: VideoscalerPHPBusyWorkers: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DVideoscalerPHPBusyWorkers [22:58:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9978694 (10wiki_willy) Thanks for testing this out @Papaul. Since it appears that upgrading the WMF environment to PXELINUX version 6.04 may fix this i... [23:04:08] (03CR) 10CDobbins: [C:03+2] taskgen: Ignore ncredir domain typos [puppet] - 10https://gerrit.wikimedia.org/r/1053426 (owner: 10BCornwall) [23:12:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:18:02] (03PS10) 10Dzahn: mailman3: defined type to sync list members, create timers for each list [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) [23:18:03] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1053399/3218/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [23:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:19:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T367856)', diff saved to https://phabricator.wikimedia.org/P66420 and previous config saved to /var/cache/conftool/dbconfig/20240712-231912-marostegui.json [23:19:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:20:05] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9978767 (10Dzahn) @Urbanecm https://... [23:24:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:24:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:25:36] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:27:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1437.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:27:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet are marked down but pooled: jobrunner_443: Servers mw1437.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:30:36] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:30:40] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1437:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:53] (03CR) 10RLazarus: switchdc: prepare mediawiki cache warmup for bare-metal turndown (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [23:34:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P66421 and previous config saved to /var/cache/conftool/dbconfig/20240712-233419-marostegui.json [23:34:32] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Include EditEngine Form changes [puppet] - 10https://gerrit.wikimedia.org/r/1052953 (https://phabricator.wikimedia.org/T369548) (owner: 10Aklapper) [23:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054000 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054000 (owner: 10TrainBranchBot) [23:39:47] (03CR) 10Dzahn: [C:03+2] "no problem here. the result of the query is an empty set though as of now" [puppet] - 10https://gerrit.wikimedia.org/r/1052953 (https://phabricator.wikimedia.org/T369548) (owner: 10Aklapper) [23:49:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P66422 and previous config saved to /var/cache/conftool/dbconfig/20240712-234926-marostegui.json