[00:00:35] !log removing 1 file for legal compliance [00:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:29] 06SRE, 10LDAP-Access-Requests, 06WMF-Legal: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820 (10SSpalding-WMF) 03NEW As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks in Wikimedia Phabricato... [00:10:25] (03PS1) 10Bking: wdqs102[567]: install OS [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) [00:12:36] (03PS2) 10Bking: wdqs102[567]: install OS [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) [00:13:44] PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp7014.magru.wmnet, cp7016.magru.wmnet, cp7010.magru.wmnet are marked down but pooled: testlb_80: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: testlb_443: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: uploadlb_443: Servers cp7014.magru.wmnet, cp7016.magru.wmnet, cp7010.ma [00:13:44] t are marked down but pooled: textlb_80: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: textlb_443: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:16:17] !log removing 6 files for legal compliance [00:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:03] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good! The file's getting small 🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) (owner: 10Jdlrobson) [00:28:03] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [00:28:07] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [00:29:32] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10355941 (10Dzahn) >>! In T380164#10335127, @MatthewVernon wrote: > Hm, yes, I took the path from a production host, where the key is installed into `/etc/apt/keyrings` by puppet... [00:36:47] (03CR) 10Jforrester: [C:03+2] "Getting this onto the deployment server now so that the scap build for the train doesn't break." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) (owner: 10Jforrester) [00:37:30] (03Merged) 10jenkins-bot: wikitech: Stop loading the i18n for LdapAuthentication, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) (owner: 10Jforrester) [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1097567 [00:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1097567 (owner: 10TrainBranchBot) [00:40:41] (03PS2) 10Jforrester: build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955) [00:40:47] (03CR) 10Jforrester: [C:03+2] build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955) (owner: 10Jforrester) [00:41:33] (03Merged) 10jenkins-bot: build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955) (owner: 10Jforrester) [00:49:34] (03CR) 10Zabe: [C:03+1] Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [00:52:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) (owner: 10Jdlrobson) [00:55:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [00:55:34] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bu... [00:56:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1097567 (owner: 10TrainBranchBot) [00:57:56] PROBLEM - Host lvs7003 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:26] RECOVERY - Host lvs7003 is UP: PING OK - Packet loss = 0%, RTA = 115.15 ms [01:01:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T376150 [01:01:35] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [01:01:44] PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [01:01:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T376150 [01:02:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [01:03:46] PROBLEM - pybal on lvs7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:03:56] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet, repooling source-only afterwards [01:04:10] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7015.magru.wmnet with OS bullseye [01:04:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10355998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullse... [01:04:17] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet, repooling source-only afterwards [01:04:54] PROBLEM - PyBal connections to etcd on lvs7003 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [01:06:17] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on lvs[7001-7003].magru.wmnet with reason: site is depooled, maintenance [01:06:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on lvs[7001-7003].magru.wmnet with reason: site is depooled, maintenance [01:08:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1097575 [01:08:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1097575 (owner: 10TrainBranchBot) [01:08:44] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet, repooling source-only afterwards [01:08:48] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [01:21:46] RECOVERY - pybal on lvs7003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:21:46] RECOVERY - PyBal backends health check on lvs7003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:24:54] RECOVERY - PyBal connections to etcd on lvs7003 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [01:27:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1097575 (owner: 10TrainBranchBot) [01:29:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [01:29:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10356021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS... [01:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:49] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet, repooling source-only afterwards [01:48:00] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [01:51:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:59:29] (03CR) 10Andrew Bogott: [C:03+2] Neutron: remove linuxbridge from mechanism_drivers [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [02:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.5 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097581 (https://phabricator.wikimedia.org/T375664) [02:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.5 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097581 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [02:15:18] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10356064 (10BCornwall) @Dzahn This is probably best to be opened as another ticket - wmf-debci isn't handling the placement of those files, it's using the `docker-registry.wikime... [02:16:51] (03CR) 10Andrew Bogott: [C:03+2] neutron.conf: remove [experimental] linuxbridge section [puppet] - 10https://gerrit.wikimedia.org/r/1094471 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [02:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:24:05] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet, repooling source-only afterwards [02:24:41] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.5 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097581 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [02:28:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:28:51] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet, repooling source-only afterwards [02:29:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T376150 [02:29:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs[2026-2027].codfw.wmnet with reason: T376150 [02:31:35] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10356110 (10BCornwall) [02:32:02] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [02:34:15] !log Import libvmod-netmapper 1.9.1-1 into varnish-staging apt component [02:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:41:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance [02:41:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0300) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:40] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet, repooling neither afterwards [03:07:25] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet, repooling neither afterwards [03:07:33] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling neither afterwards [03:08:17] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal scholarly tier) xfer scholarly_articles from wdqs2024.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling neither afterwards [03:08:58] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 20:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 non-prod hosts [03:09:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 20:00:00 on wdqs[2018-2020,2026-2027].codfw.wmnet with reason: T376150 non-prod hosts [03:11:01] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [03:23:07] (03PS1) 10MusikAnimal: Add BetaFeature for CodeMirror 6 [extensions/CodeMirror] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097591 (https://phabricator.wikimedia.org/T376735) [03:40:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2215.codfw.wmnet with reason: Maintenance [03:40:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2215.codfw.wmnet with reason: Maintenance [03:40:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2215 (T380449)', diff saved to https://phabricator.wikimedia.org/P71163 and previous config saved to /var/cache/conftool/dbconfig/20241126-034040-ladsgroup.json [03:41:19] T380449: Optimize two echo tables in x1 - https://phabricator.wikimedia.org/T380449 [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0400) [04:03:43] (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [04:30:09] (03CR) 10Vgutierrez: "I just found one domain that shouldn't be there apparently, besides that I'm wondering if we could group base domains in the same certific" [puppet] - 10https://gerrit.wikimedia.org/r/1092931 (owner: 10Ncmonitor) [04:36:43] (03CR) 10Vgutierrez: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [04:42:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CodeMirror] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097591 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0500) [05:02:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:08:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:24:08] (03PS1) 10Physikerwelt: Fix: handling of nullary macros [extensions/Math] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097840 (https://phabricator.wikimedia.org/T380184) [06:28:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:45:02] (03CR) 10CI reject: [V:04-1] Fix: handling of nullary macros [extensions/Math] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097840 (https://phabricator.wikimedia.org/T380184) (owner: 10Physikerwelt) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0700) [07:00:05] marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0700). [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:32:25] (03PS1) 10Giuseppe Lavagetto: UI bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1097910 [07:32:39] !log joal@deploy2002 Started deploy [analytics/refinery@f48b8de]: Regular analytics weekly train [analytics/refinery@f48b8de2] [07:32:41] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] UI bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1097910 (owner: 10Giuseppe Lavagetto) [07:33:10] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "UI bugfixes - oblivian@cumin1002" [07:33:12] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: UI bugfixes - oblivian@cumin1002 [07:33:42] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: UI bugfixes - oblivian@cumin1002 [07:33:44] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "UI bugfixes - oblivian@cumin1002" [07:34:42] !log joal@deploy2002 Finished deploy [analytics/refinery@f48b8de]: Regular analytics weekly train [analytics/refinery@f48b8de2] (duration: 02m 03s) [07:34:52] !log joal@deploy2002 Started deploy [analytics/refinery@f48b8de] (thin): Regular analytics weekly train THIN [analytics/refinery@f48b8de2] [07:35:28] !log joal@deploy2002 Finished deploy [analytics/refinery@f48b8de] (thin): Regular analytics weekly train THIN [analytics/refinery@f48b8de2] (duration: 00m 35s) [07:35:53] !log joal@deploy2002 Started deploy [analytics/refinery@f48b8de] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f48b8de2] [07:36:23] !log joal@deploy2002 Finished deploy [analytics/refinery@f48b8de] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f48b8de2] (duration: 00m 29s) [07:36:27] (03PS1) 10Muehlenhoff: Readd ganeti role to ganeti7003/7004 [puppet] - 10https://gerrit.wikimedia.org/r/1097911 (https://phabricator.wikimedia.org/T376737) [07:41:57] (03CR) 10Muehlenhoff: [C:03+2] Readd ganeti role to ganeti7003/7004 [puppet] - 10https://gerrit.wikimedia.org/r/1097911 (https://phabricator.wikimedia.org/T376737) (owner: 10Muehlenhoff) [07:54:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.depool db1233 - clone on db1246 [07:54:19] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1233 - clone on db1246 [07:54:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'manual depool commit', diff saved to https://phabricator.wikimedia.org/P71164 and previous config saved to /var/cache/conftool/dbconfig/20241126-075433-arnaudb.json [07:56:31] (03CR) 10Daniel Kinzler: rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [08:02:35] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0800). [08:02:35] musikanimal: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:45] o/ [08:05:43] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7003 [08:05:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7003 [08:06:35] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7004 [08:06:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7004 [08:07:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [08:08:48] any deployers around? I think this one's easy… just need to get a patch in wmf.5 before it deploys tomorrow [08:11:56] this comment is concerning https://phabricator.wikimedia.org/T375664#10356362 [08:12:17] is wmf.5 not going to be deployed? 😢 [08:16:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [08:17:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [08:25:24] (03PS1) 10Slyngshede: P:idp Add blackbox probe to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) [08:25:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [08:25:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [08:28:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [08:37:57] PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: arzwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:38:25] on it [08:38:39] (03PS2) 10Slyngshede: P:idp Add blackbox probe to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) [08:40:57] RECOVERY - MariaDB Replica SQL: s3 on db1150 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:03] (03PS3) 10Slyngshede: P:idp Add blackbox probe to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) [08:42:52] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4593/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:43:45] (03PS1) 10Arnaudb: mariadb: pool back db1246 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1097943 (https://phabricator.wikimedia.org/T374215) [08:46:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4594/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:46:39] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2005-2006,2015-2016].codfw.wmnet [08:46:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2005-2006,2015-2016].codfw.wmnet [08:48:35] !log dcausse@deploy2002 Started deploy [airflow-dags/search@f969d75]: search: swift_upload.py moved to refinery/bin/ [08:48:51] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1005-1006,1015-1016].eqiad.wmnet [08:49:00] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1005-1006,1015-1016].eqiad.wmnet [08:49:03] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@f969d75]: search: swift_upload.py moved to refinery/bin/ (duration: 00m 27s) [08:49:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4595/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:52:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [08:52:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [08:52:30] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4596/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:52:48] !log jayme@cumin2002 START - Cookbook sre.hosts.decommission for hosts kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet [08:53:21] (03CR) 10JMeybohm: [C:03+2] Decom kubernetes[12]0[01][56] dedicates sessionstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/1097442 (https://phabricator.wikimedia.org/T379599) (owner: 10JMeybohm) [08:55:37] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4597/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:57:54] (03CR) 10MVernon: [C:03+1] mariadb: pool back db1246 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1097943 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb) [08:58:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:16] this is me [08:58:24] (03CR) 10Arnaudb: [C:03+2] mariadb: pool back db1246 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1097943 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb) [08:58:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:59:32] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4598/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [09:00:15] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0900) [09:02:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:03:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:03:29] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [09:03:32] (03PS4) 10Slyngshede: P:idp Add blackbox probe to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) [09:03:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:03:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:04:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:05:28] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp2027 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:06:30] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp2027 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:08:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:09:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:11:32] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7003.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:11:45] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [09:12:33] (03CR) 10Elukey: [C:03+2] sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [09:13:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:13:50] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 (owner: 10JMeybohm) [09:14:00] PROBLEM - Host ganeti7003 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10356459 (10Gehel) p:05Triage→03High [09:14:13] (03CR) 10JMeybohm: [C:03+2] k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 (owner: 10JMeybohm) [09:16:12] hashar: hi, I checked the error from the nightly patches check job, it seems to be failing due to the same failure that made the presync fail. We are getting sporadic failures from the train-blockers app [09:16:32] https://www.irccloud.com/pastebin/NoUkT25o/ [09:19:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7003.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:20:32] RECOVERY - Host ganeti7003 is UP: PING OK - Packet loss = 0%, RTA = 115.25 ms [09:20:44] (03Merged) 10jenkins-bot: k8s.reboot-nodes: Allow to filter nodes by --query [cookbooks] - 10https://gerrit.wikimedia.org/r/1097367 (owner: 10JMeybohm) [09:21:07] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin2002" [09:21:49] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin2002" [09:21:50] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:21:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet [09:22:04] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 238, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:23:45] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7004.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:25:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [09:25:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [09:26:58] PROBLEM - Host ganeti7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:10] (03PS1) 10Muehlenhoff: Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 [09:31:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7004.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:31:30] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance stat1011:9100) - https://phabricator.wikimedia.org/T380835 (10LSobanski) 03NEW [09:32:27] RECOVERY - Host ganeti7004 is UP: PING OK - Packet loss = 0%, RTA = 115.22 ms [09:32:40] (03CR) 10CI reject: [V:04-1] Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 (owner: 10Muehlenhoff) [09:33:43] (03PS2) 10Muehlenhoff: Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 [09:38:19] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 322, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:33] (03CR) 10CI reject: [V:04-1] Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 (owner: 10Muehlenhoff) [09:40:38] (03PS2) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:40:51] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:41:07] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:43:33] (03PS3) 10Muehlenhoff: Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 [09:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:19] (03CR) 10Muehlenhoff: [C:03+2] Temporarily remove ferm node check [cookbooks] - 10https://gerrit.wikimedia.org/r/1097955 (owner: 10Muehlenhoff) [09:52:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [09:53:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [09:54:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [09:56:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7004.magru.wmnet to cluster magru02 and group B4 [09:56:38] jnuche: sorry you r message got caught in between other notifications :) [09:57:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to drbd [09:58:02] hashar: np, it seems to be a name resolution issue: `{"error":"php_network_getaddresses: getaddrinfo for tools.db.svc.eqiad.wmflabs failed: Temporary failure in name resolution"}` [09:58:50] looks like toolforge has some issues [10:02:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to drbd [10:04:42] FIRING: JobUnavailable: Reduced availability for job squid in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:54] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Alert in need of triage: SmartNotHealthy (instance stat1011:9100) - https://phabricator.wikimedia.org/T380835#10356671 (10BTullis) [10:12:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to drbd [10:12:43] PROBLEM - Host install7001 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:57] RECOVERY - Host install7001 is UP: PING OK - Packet loss = 0%, RTA = 115.67 ms [10:12:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to drbd [10:13:05] PROBLEM - Host ncredir7002 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:17] RECOVERY - Host ncredir7002 is UP: PING OK - Packet loss = 0%, RTA = 115.63 ms [10:14:11] FIRING: [14x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to drbd [10:15:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to drbd [10:18:35] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:35] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:53] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:18:59] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:19:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) (owner: 10Gergő Tisza) [10:22:12] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to drbd [10:26:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to drbd [10:26:07] PROBLEM - Host doh7001 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:21] PROBLEM - Host durum7002 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:25] RECOVERY - Host doh7001 is UP: PING OK - Packet loss = 0%, RTA = 115.61 ms [10:26:29] RECOVERY - Host durum7002 is UP: PING OK - Packet loss = 0%, RTA = 115.75 ms [10:27:19] RESOLVED: [4x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:27:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to drbd [10:28:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to drbd [10:28:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:31:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [10:34:42] FIRING: JobUnavailable: Reduced availability for job bird in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:38:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to drbd [10:38:27] PROBLEM - Host doh7002 is DOWN: PING CRITICAL - Packet loss = 100% [10:38:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to drbd [10:38:29] PROBLEM - Host durum7001 is DOWN: PING CRITICAL - Packet loss = 100% [10:38:33] RECOVERY - Host durum7001 is UP: PING OK - Packet loss = 0%, RTA = 115.59 ms [10:38:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2013 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:38:59] RECOVERY - Host doh7002 is UP: PING OK - Packet loss = 0%, RTA = 115.55 ms [10:39:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job bird in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:21] PROBLEM - Bird Internet Routing Daemon on durum7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:40:21] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:40:55] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:41:21] RECOVERY - Bird Internet Routing Daemon on durum7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:41:21] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7001 is OK: OK: UP (pid=2358) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:41:35] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:35] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:01] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:42:18] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_eqiad and A:cp for 9.2.6-1wm2 [10:42:25] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_eqiad and A:cp for 9.2.6-1wm2 [10:42:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to drbd [10:42:47] jouncebot: nowandnext [10:42:47] For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T0900) [10:42:47] In 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1100) [10:43:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to drbd [10:43:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2013 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:44:15] I am running the train now (ping andre) [10:44:29] it got delayed due to some DNS / WMCS issue this morning [10:45:08] thanks [10:45:11] ah [10:45:11] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097980 (https://phabricator.wikimedia.org/T375664) [10:45:13] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097980 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [10:45:16] it did not even reach the testwikis :/ [10:45:32] ok then I'll wait for mcrouter [10:45:34] claime: if you had something in mind for the infra, feel free to push it I think [10:45:41] cause I am just doing the testwikis now [10:45:55] or is that some mediawiki-config change? [10:45:56] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097980 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [10:46:23] (03CR) 10Gmodena: [C:03+2] config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [10:46:34] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [10:46:51] hashar: nah, it's mcrouter, but e.ffie's taking care of it [10:46:55] it'll be fine [10:47:07] okish! [10:47:12] (03Merged) 10jenkins-bot: config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [10:47:37] !log hashar@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.5 refs T375664 [10:47:41] T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664 [10:47:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:17] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:48:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:12] FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:43] (03PS1) 10Clément Goubert: Revert^2 "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097981 [10:52:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:52:32] well the train is only building the image currently [10:52:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to drbd [10:52:53] hashar: I am causing a few mediawiki memcached errors, I will prolly be done by the time you scap starts rolling out [10:52:57] PROBLEM - Host ncredir7001 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:03] RECOVERY - Host ncredir7001 is UP: PING OK - Packet loss = 0%, RTA = 115.63 ms [10:53:26] effie: sounds good :) thanks ! [10:53:43] 06SRE, 10LDAP-Access-Requests, 06WMF-Legal: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10356875 (10Aklapper) [10:54:12] RESOLVED: [3x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:51] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10356878 (10Aklapper) Not sure why this was tagged with #WMF-Legal? [10:55:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: repool', diff saved to https://phabricator.wikimedia.org/P71168 and previous config saved to /var/cache/conftool/dbconfig/20241126-105531-arnaudb.json [10:55:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 5%: repool', diff saved to https://phabricator.wikimedia.org/P71169 and previous config saved to /var/cache/conftool/dbconfig/20241126-105550-arnaudb.json [10:56:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to drbd [10:56:51] (03PS3) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [10:57:05] (03PS1) 10Matthias Mullie: Fix incorrect 'this' [extensions/UploadWizard] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097983 [10:57:38] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:58:44] the image got build it is publishing to the registry [10:59:45] (03CR) 10Matthias Mullie: [C:03+1] Fix incorrect 'this' [extensions/UploadWizard] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097983 (owner: 10Matthias Mullie) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1100) [11:00:17] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:01:10] Lucas_WMDE: if & when you have a moment I have a quick question at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexemeCirrusSearch/+/1097422/4/src/LexemeFieldDefinitions.php#63 regarding the ordering stability of Lexeme::getForms()->toArrayUnordered() [11:02:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to drbd [11:03:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to drbd [11:03:42] FIRING: [2x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:32] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:06:20] dcausse: replied [11:06:21] PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp7014.magru.wmnet, cp7016.magru.wmnet, cp7010.magru.wmnet are marked down but pooled: testlb_80: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: testlb_443: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: uploadlb_443: Servers cp7014.magru.wmnet, cp7016.magru.wmnet, cp7010.ma [11:06:21] t are marked down but pooled: textlb_80: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled: textlb_443: Servers cp7004.magru.wmnet, cp7002.magru.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:06:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to drbd [11:06:45] PROBLEM - Host netflow7001 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:10] uh is that lvs7003 alert real? looks like a recent reimage [11:07:21] RECOVERY - Host netflow7001 is UP: PING OK - Packet loss = 0%, RTA = 115.48 ms [11:07:22] magru is depooled [11:07:28] ah [11:07:37] and I think fabfur is currently reimaging it [11:07:49] and lvs alerts should be also silenced [11:08:31] I'll downtime again for a few hours, if all goes well today we should reimage all hosts on magru [11:08:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain [11:09:24] ack, thanks! [11:09:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain [11:10:02] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on lvs7001.magru.wmnet with reason: T376737 [11:10:16] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on lvs7001.magru.wmnet with reason: T376737 [11:10:17] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:10:20] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on lvs7002.magru.wmnet with reason: T376737 [11:10:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on lvs7002.magru.wmnet with reason: T376737 [11:10:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 2%: repool', diff saved to https://phabricator.wikimedia.org/P71170 and previous config saved to /var/cache/conftool/dbconfig/20241126-111036-arnaudb.json [11:10:40] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on lvs7003.magru.wmnet with reason: T376737 [11:10:55] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on lvs7003.magru.wmnet with reason: T376737 [11:10:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: repool', diff saved to https://phabricator.wikimedia.org/P71171 and previous config saved to /var/cache/conftool/dbconfig/20241126-111056-arnaudb.json [11:11:42] (03CR) 10Elukey: "Left some nits but LGTM, I'd test this on minikube with our current version of k8s and istio though.." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1097345 (https://phabricator.wikimedia.org/T380723) (owner: 10Klausman) [11:12:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:12:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain [11:13:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:16:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain [11:20:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain [11:21:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain [11:22:04] pff [11:22:16] https://boardgovcom.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:43) [11:22:16] Status code: expected 200, got 503. [11:22:16] Body: expected to contain 'Board Governance Committee', got '\n\n gmodena: ottomata: ^ I am reverting your patch [11:25:59] cause the train is ongoing [11:26:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 15%: repool', diff saved to https://phabricator.wikimedia.org/P71173 and previous config saved to /var/cache/conftool/dbconfig/20241126-112601-arnaudb.json [11:26:02] and I don't want to deploy that one [11:26:07] at least not now [11:26:13] PROBLEM - Bird Internet Routing Daemon on doh7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:26:29] (03CR) 10Hashar: [C:03+2] Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097995 (owner: 10Hashar) [11:26:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain [11:26:53] (03PS1) 10David Caro: cloudcepmon1004: move to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1097997 [11:27:13] RECOVERY - Bird Internet Routing Daemon on doh7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:27:16] (03Merged) 10jenkins-bot: Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097995 (owner: 10Hashar) [11:27:35] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:36] (03CR) 10David Caro: [C:03+2] cloudcepmon1004: move to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1097997 (owner: 10David Caro) [11:28:01] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:28:25] !log failover Ganeti master in magru01 to ganeti7003 [11:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:52] !log hashar@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.5 refs T375664 [11:28:58] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [11:29:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eq... [11:30:21] PROBLEM - ganeti-wconfd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:31:33] !log remove ganeti7001 from active Ganeti nodes in magru01 [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:21] PROBLEM - ganeti-confd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:34:22] PROBLEM - ganeti-noded running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:35:47] (03Abandoned) 10Hnowlan: admin_ng: set a very high quota for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085579 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:36:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:39] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_eqiad and A:cp for 9.2.6-1wm2 [11:38:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_eqiad and A:cp for 9.2.6-1wm2 [11:40:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: repool', diff saved to https://phabricator.wikimedia.org/P71174 and previous config saved to /var/cache/conftool/dbconfig/20241126-114047-arnaudb.json [11:41:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P71175 and previous config saved to /var/cache/conftool/dbconfig/20241126-114106-arnaudb.json [11:43:38] (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098006 (https://phabricator.wikimedia.org/T373037) [11:44:53] jouncebot: nowandnext [11:44:53] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1100) [11:44:53] In 1 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1300) [11:45:18] (03CR) 10Ladsgroup: [C:03+2] Bump ratio of new parsercache key spec to 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098006 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:45:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098006 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:45:51] Amir1: the train is still ongoing [11:46:05] ah okay [11:46:05] (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098006 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:46:23] didn't see anything in the calendar and scroll up [11:48:44] yeah I am willing to move mediawiki deployments to another channel [11:48:54] this one has wayyy too much traffic nowadays [11:52:23] <_joe_> hashar: if the train is ongoing during a mw infra window you need to notify SRE serviceops [11:53:36] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_esams and A:cp for 9.2.6-1wm2 [11:53:37] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_esams and A:cp for 9.2.6-1wm2 [11:53:50] _joe_: clément reached out here about it [11:53:56] about some memcached operation [11:54:44] !log hashar@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.5 refs T375664 (duration: 25m 52s) [11:54:48] T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664 [11:55:01] 25 minutes :( [11:55:07] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1098006|Bump ratio of new parsercache key spec to 4 (T373037)]] [11:55:11] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:55:35] ... [11:55:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 15%: repool', diff saved to https://phabricator.wikimedia.org/P71176 and previous config saved to /var/cache/conftool/dbconfig/20241126-115552-arnaudb.json [11:56:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P71177 and previous config saved to /var/cache/conftool/dbconfig/20241126-115612-arnaudb.json [11:57:47] Amir1: you could have at least have asked [11:57:57] cause now I have to wait for your scap in order to continue on the train [11:58:02] that has been ongoing for 3 hours already [11:58:04] :/ [11:59:14] hashar: it'll be done really quickly [12:00:03] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10357162 (10Joe) [12:00:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:01:15] (btw I didn't intentionally started the scap, it was just holding the lock and when it got freed, it continued) [12:01:58] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1098006|Bump ratio of new parsercache key spec to 4 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:02:00] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:02:02] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:02:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to drbd [12:03:29] RECOVERY - Host prometheus7001 is UP: PING OK - Packet loss = 0%, RTA = 115.77 ms [12:03:52] Amir1: yeah it is magically continuing [12:04:05] and sorry I am upset with how long scap takes nowadays :b [12:04:58] wonder how we can make this better. For me, CI of the backports are now the biggest pain [12:05:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to plain [12:05:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to plain [12:07:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to plain [12:07:06] FIRING: [13x] ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to plain [12:09:11] FIRING: [13x] ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to plain [12:10:29] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098006|Bump ratio of new parsercache key spec to 4 (T373037)]] (duration: 15m 21s) [12:10:33] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:10:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to plain [12:10:47] ok lets go to group0 now [12:10:58] if your patch is working :) [12:10:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 20%: repool', diff saved to https://phabricator.wikimedia.org/P71178 and previous config saved to /var/cache/conftool/dbconfig/20241126-121058-arnaudb.json [12:11:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P71179 and previous config saved to /var/cache/conftool/dbconfig/20241126-121117-arnaudb.json [12:11:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to plain [12:12:23] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:27] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:13:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to plain [12:13:21] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098011 (https://phabricator.wikimedia.org/T375664) [12:13:22] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098011 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [12:13:27] PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:13:27] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:14:10] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098011 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [12:14:27] RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:14:27] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2433) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:15:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to plain [12:16:23] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:16:27] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to plain [12:19:37] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7015 [12:19:51] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7015 [12:20:08] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs7003 [12:20:24] !log failover Ganeti master in magru02 to ganeti7004 [12:20:25] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs7003 [12:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:30] (03PS1) 10Btullis: Revert "Failover analytics-hive to standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1098012 [12:20:37] (03PS2) 10Btullis: Revert "Failover analytics-hive to standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1098012 [12:20:51] !log robh@cumin2002 START - Cookbook sre.dns.netbox [12:22:27] PROBLEM - ganeti-wconfd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:23:04] (03CR) 10Btullis: [C:03+2] Revert "Failover analytics-hive to standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1098012 (owner: 10Btullis) [12:23:09] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:25:47] hashar: which scap step in particular is long? because looking at sync-prod-k8s event duration in logstash over the past two weeks, median time is under 6 minutes and max time is 7.5 minutes [12:25:58] no clue [12:26:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P71180 and previous config saved to /var/cache/conftool/dbconfig/20241126-122603-arnaudb.json [12:26:12] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.5 refs T375664 [12:26:16] T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664 [12:26:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P71181 and previous config saved to /var/cache/conftool/dbconfig/20241126-122622-arnaudb.json [12:26:27] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10357268 (10MoritzMuehlenhoff) [12:26:51] the first run of the week is always super slow [12:27:02] but that is usually done via a cron job, which did not happen today [12:27:03] :) [12:27:22] I took a copy of the slowish 25 minute run from this morning, I will check [12:28:12] and now that group0 promotion took just 13 minutes [12:28:17] so it is random() [12:30:26] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [12:30:34] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7015.magru.wmnet with OS bul... [12:31:43] build-and-push-container-images (duration: 12m 16s) for the 11:00 utc run [12:33:37] 10:52:51 [mediawiki-publish] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-11-26-104959-publish [12:33:39] 11:01:59 [mediawiki-publish-81] docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-11-26-104959-publish-81 [12:34:29] so in particular it's pushing the mediawiki image to the regisry that took a long time [12:34:48] keep in mind because of the php upgrade we're building and pushing 2x the images [12:36:08] I think it is an aftermath of the first scap that did not fully complete [12:36:31] cause of the smoke test failure [12:36:32] and the 8.1 image was probably fully rebuilt because the image was updated yesterday [12:37:03] and when I ran it again, it merely resumed and still had to sync all the hosts that come after the canary [12:38:03] (03CR) 10Hnowlan: rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [12:38:22] at least group0 looks fine [12:38:27] PROBLEM - ganeti-confd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:38:27] PROBLEM - ganeti-noded running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:39:11] FIRING: [13x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P71182 and previous config saved to /var/cache/conftool/dbconfig/20241126-124109-arnaudb.json [12:46:05] PROBLEM - Host ganeti7001 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:49] PROBLEM - Host cp7003 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:26] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098018 [12:47:29] PROBLEM - Host cp7002 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:49] PROBLEM - Host cp7010 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:49] PROBLEM - Host cp7004 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:55] PROBLEM - Host dns7002 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:27] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:34] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7015.magru.wmnet with OS bullseye [12:48:37] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullsey... [12:48:39] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:48:57] FIRING: [7x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:09] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:25] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098018 (owner: 10Muehlenhoff) [12:51:26] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_esams and A:cp for 9.2.6-1wm2 [12:52:08] assuming that the page for ncredir being down in margru is part of the work [12:52:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:53:02] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357367 (10RobH) [12:53:21] fabfur: am I okay in that ^ assumption? [12:53:38] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [12:53:48] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_esams and A:cp for 9.2.6-1wm2 [12:53:54] hnowlan: yes, the downtime is expired [12:53:57] FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:14] uh? [12:54:17] oh.. magru :) [12:54:37] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357380 (10RobH) [12:54:52] * kamila_ is here but in a therapy session, since it doesn't look serious I'll ignore for now, please ping me if you need me to not ignore [12:55:59] (03PS1) 10Muehlenhoff: Remove LDAP access for wquarshie [puppet] - 10https://gerrit.wikimedia.org/r/1098020 [12:56:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P71183 and previous config saved to /var/cache/conftool/dbconfig/20241126-125614-arnaudb.json [12:57:02] !incidents [12:57:02] 5475 (ACKED) [7x] ProbeDown sre (probes/service magru) [12:57:13] (03PS2) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 [12:57:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (owner: 10NMW03) [12:57:42] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [12:58:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bullseye [12:58:29] jouncebot: next [12:58:29] In 0 hour(s) and 1 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1300) [12:58:31] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS... [13:00:03] (03CR) 10Daniel Kinzler: rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1300) [13:01:30] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098020 (owner: 10Muehlenhoff) [13:01:40] (03PS1) 10Muehlenhoff: Remove site.pp entry for legacy irc.w.o nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098022 [13:02:51] (03PS1) 10Slyngshede: P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) [13:03:10] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [13:03:12] (03CR) 10Muehlenhoff: [C:03+2] Remove site.pp entry for legacy irc.w.o nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098022 (owner: 10Muehlenhoff) [13:03:19] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS... [13:03:34] (03PS1) 10Elukey: admin: add Jimmy Ly's account [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) [13:03:57] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4599/co" [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [13:04:29] (03PS2) 10Elukey: admin: add Jimmy Ly's account [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) [13:06:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357449 (10elukey) [13:07:25] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7001.wikimedia.org with OS bookworm [13:07:28] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS boo... [13:08:30] (03PS14) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [13:08:41] (03CR) 10Hnowlan: mediawiki: add mercurius features (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [13:09:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357455 (10elukey) @thcipriani Hi! I'd need your review to grant access to the Deployment group, lemme know your thoughts :) No approval required... [13:09:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10357459 (10elukey) [13:10:41] RECOVERY - Host cp7003 is UP: PING OK - Packet loss = 0%, RTA = 115.10 ms [13:10:45] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:10:47] PROBLEM - Ensure traffic_server is running for instance backend on cp7003 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:11:11] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:11:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P71185 and previous config saved to /var/cache/conftool/dbconfig/20241126-131120-arnaudb.json [13:11:46] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [13:11:53] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS b... [13:12:26] (03CR) 10Elukey: [C:04-1] "The ssh key is being reused, stalling for the moment." [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) (owner: 10Elukey) [13:13:35] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357467 (10elukey) @Jly Hi! You are currently using the same SSH key for both production (this request) and WMCS, so I'd ask you to create a new o... [13:13:57] RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:11] FIRING: [23x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:01] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [13:15:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.... [13:17:06] FIRING: [23x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:12] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357474 (10aborrero) p:05Triage→03Medium hey @Jhancock.wm @Jclark-ctr Do you know if this is concerning, and if we should be taking proactive acti... [13:17:37] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357479 (10aborrero) a:03Jhancock.wm [13:17:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:01] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [13:20:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [13:21:28] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host dns7001.wikimedia.org [13:21:53] RECOVERY - Host ganeti7001 is UP: PING OK - Packet loss = 0%, RTA = 115.22 ms [13:24:31] RECOVERY - Host cp7004 is UP: PING OK - Packet loss = 0%, RTA = 115.10 ms [13:24:49] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:24:55] PROBLEM - Ensure traffic_server is running for instance backend on cp7004 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:24:57] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 115.30 ms [13:25:11] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:25:45] PROBLEM - ganeti-confd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:25:49] PROBLEM - ganeti-noded running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:26:32] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp7001.magru.wmnet with reason: T376737 [13:26:46] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp7001.magru.wmnet with reason: T376737 [13:26:53] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp7006.magru.wmnet with reason: T376737 [13:27:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp7006.magru.wmnet with reason: T376737 [13:27:13] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp7008.magru.wmnet with reason: T376737 [13:27:26] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp7008.magru.wmnet with reason: T376737 [13:28:07] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp7002.magru.wmnet with reason: T376737 [13:28:21] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp7002.magru.wmnet with reason: T376737 [13:28:32] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7002.magru.wmnet with reason: T376737 [13:28:35] claime: about the lengthy scap run, the issue is the automatic presync did not run during the night and my first scap had to build everything (cdb, the two php images as you mentioned) [13:28:35] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7002.magru.wmnet with reason: T376737 [13:28:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10357523 (10elukey) I followed up with @Kgraessle and the `analytics-privatedata-users` group seems to be the best one to access the Mariadb prod replicas (according to [[ https:... [13:28:47] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7003.magru.wmnet with reason: T376737 [13:28:54] + some bug in scap caused to auto abort and forced me to start again the train sync which added more overhead [13:29:01] so I don't think anything was slower than usual [13:29:10] it is just that all the build time normally happens from a timer [13:29:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7003.magru.wmnet with reason: T376737 [13:29:19] :) [13:29:59] !log swift delete wikipedia-commons-local-public.bf b/bf/Schuur_-_Nieuwerbrug_-_20164513_-_RCE.jpg T380738 [13:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] T380738: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738 [13:32:07] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [13:32:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357542 (10Jly) [13:33:17] (03PS1) 10Elukey: admins: add ssh access for user kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1098033 (https://phabricator.wikimedia.org/T379173) [13:33:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357549 (10dcaro) 05Open→03Resolved [13:33:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357555 (10Jly) @elukey Got it, I have updated the key now, please see [13:34:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7003.magru.wmnet with reason: T376737 [13:34:27] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7003.magru.wmnet with reason: T376737 [13:34:34] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7010.magru.wmnet with reason: T376737 [13:34:47] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7010.magru.wmnet with reason: T376737 [13:35:19] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [13:35:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10357545 (10dcaro) 05Resolved→03Open Node up and running [13:38:20] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7004.magru.wmnet with reason: T376737 [13:38:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7004.magru.wmnet with reason: T376737 [13:38:59] 06SRE, 10SRE-swift-storage, 06Commons: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738#10357584 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Fixed (thanks to @Ladsgroup for doing the necessary re-uploads once I'd deleted... [13:39:53] RECOVERY - PyBal backends health check on lvs7003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:40:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host dns7001.wikimedia.org [13:40:28] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10357597 (10Ladsgroup) Now 17 different deletion scripts are ongoing: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-site=codfw&viewPanel=7&from=now-24h&to=now-1m {F57749... [13:40:59] RECOVERY - Host dns7002 is UP: PING OK - Packet loss = 0%, RTA = 115.17 ms [13:41:57] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:41:57] PROBLEM - Bird Internet Routing Daemon on dns7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:41:59] (03PS1) 10Btullis: Add a thirdparty/bigtop33 component to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1098037 (https://phabricator.wikimedia.org/T380866) [13:42:03] RECOVERY - Host cp7002 is UP: PING OK - Packet loss = 0%, RTA = 115.13 ms [13:42:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:43:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bullseye [13:43:29] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with O... [13:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:35] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:57] RECOVERY - Bird Internet Routing Daemon on dns7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:44:57] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns7002 is OK: OK: UP (pid=4735) and all threads (3) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:44:57] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:45:15] PROBLEM - NTP peers and stratum check on dns7002 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown, stratum=-1 (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [13:46:47] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [13:47:06] FIRING: [13x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:31] (03CR) 10Anzx: [C:04-1] "Please follow the instructions in https://gerrit.wikimedia.org/g/operations/mediawiki-config/%2B/refs/heads/master/logos/ for generating n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (owner: 10NMW03) [13:49:15] !log ladsgroup@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [13:49:34] 06SRE, 10SRE-swift-storage, 06Commons: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738#10357646 (10Jeff_G) My version of 13:43 seems to have less compression, but Ladsgroup's of 13:33 would be fine, too. [13:49:36] (03CR) 10Elukey: [C:03+1] Add komla to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) (owner: 10FNegri) [13:49:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:49:54] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [13:49:55] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7003.magru.wmnet with OS bullseye [13:50:00] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bull... [13:50:29] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10357644 (10elukey) Reached out on Slack to verify the ssh key. [13:51:41] (03CR) 10Dreamrimmer: [C:03+1] enwiki: add "mergehistory" to "import" user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) (owner: 10Novem Linguae) [13:51:41] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 96.13 ms [13:52:24] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357672 (10elukey) Reached out to Joanna to confirm the user to the group, but it LGTM. [13:54:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Reclone (T379724) [13:54:21] T379724: s8 replication on an-redacteddb1001 is broken - https://phabricator.wikimedia.org/T379724 [13:54:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Reclone (T379724) [13:54:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Reclone (T379724) [13:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:54:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Reclone (T379724) [13:55:15] RECOVERY - NTP peers and stratum check on dns7002 is OK: NTP OK: Offset -2.8024e-05 secs, stratum=1 https://wikitech.wikimedia.org/wiki/NTP [13:55:19] (03CR) 10FNegri: [C:04-1] "Thanks Elukey for the +1, I'm adding a -1 because Jobo would prefer more granular permission levels. We have a meeting scheduled for next " [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) (owner: 10FNegri) [13:55:31] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.76 ms [13:56:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10357699 (10elukey) >>! In T379678#10357644, @elukey wrote: > Reached out on Slack to verify the ssh key. Aaand got a confirmation, so we can proceed. [13:56:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10357703 (10elukey) [13:56:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:58:28] (03PS1) 10Elukey: admin: add dbrant to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1098041 (https://phabricator.wikimedia.org/T379678) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1400). [14:00:05] matthiasmullie, tgr, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:00:13] I don’t think I can deploy today, sorry [14:00:15] o/ [14:01:15] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:01:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:01:35] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7015.magru.wmnet with OS bullseye [14:01:39] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bulls... [14:01:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:02:16] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10357729 (10fnegri) 05Open→03Stalled Joanna is out sick, but I discussed this with her and we have a team-wide meeting... [14:02:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10357735 (10Jclark-ctr) @aborrero i have updated Idrac firmware. I assume Dell will want me to update bios firmware which will require rebo... [14:02:32] o/ [14:05:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:05:58] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [14:06:59] Well, I guess I can start deploying my patch myself. Anyone around for the others in this slot? [14:07:11] yes please do :) [14:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/UploadWizard] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097983 (owner: 10Matthias Mullie) [14:07:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:07:25] I suppose I can deploy the other two [14:07:38] Great. I'll ping you when I'm done! [14:07:42] mine is just a svg update [14:08:13] Nemoralis: your change had a CodeReview -1 asking for some change [14:08:26] Nemoralis: I guess you need to run a script [14:08:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:08:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:08:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:09:06] hashar: thanks! I didn't know I had to run a script [14:09:15] and thanks anoop if you are here [14:09:23] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [14:09:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:09:44] Nemoralis: you can also remove the link to imgur.com since the screenshot will eventually disappear at some point in the future ;) [14:10:19] yeah sure [14:12:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1169 - https://phabricator.wikimedia.org/T379856#10357802 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 201596930 was successfully submitted. [14:13:28] (03PS1) 10Giuseppe Lavagetto: Grid view for object lists [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098044 [14:13:40] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Grid view for object lists [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098044 (owner: 10Giuseppe Lavagetto) [14:13:57] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add grid view - oblivian@cumin1002" [14:13:59] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add grid view - oblivian@cumin1002 [14:14:16] Nemoralis FWIW you can upload patch-related files at https://phabricator.wikimedia.org/file/upload/ and then you don't have to worry about hosting policies [14:14:25] PROBLEM - Recursive DNS on 195.200.68.4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:14:29] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add grid view - oblivian@cumin1002 [14:14:30] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add grid view - oblivian@cumin1002" [14:14:42] tgr|away: thanks [14:15:56] (03PS1) 10Joely Rooke WMDE: Remove feature flag which controls wikibase item link location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) [14:18:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:18:55] (03Merged) 10jenkins-bot: Fix incorrect 'this' [extensions/UploadWizard] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097983 (owner: 10Matthias Mullie) [14:19:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:19:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye [14:19:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [14:19:22] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1097983|Fix incorrect 'this']] [14:19:23] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [14:19:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10357819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye [14:19:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10357820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye [14:19:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10357822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye [14:19:53] tgr|away: you can ignore my patch, I can't update the patch now [14:20:14] I have go somewhere, sorry [14:20:30] ack [14:21:05] PROBLEM - Recursive DNS on 2a02:ec80:700:1:195:200:68:4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:21:11] I will update it until late backport window [14:21:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:05] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1097983|Fix incorrect 'this']] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:08] !log mlitn@deploy2002 mlitn: Continuing with sync [14:26:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:03] RECOVERY - Recursive DNS on 2a02:ec80:700:1:195:200:68:4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:29:48] ooh, what is "hiddenparma"? /me hopes it's cheese-related [14:31:09] * urandom hopes it doesn't remain hidden [14:31:27] RECOVERY - Recursive DNS on 195.200.68.4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:31:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:43] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:31:58] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097983|Fix incorrect 'this']] (duration: 12m 36s) [14:32:16] inflatador: the repo name (code name?) for requestctl.wikimedia.org [14:34:22] tgr|away: I'm done; all yours! [14:34:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10357856 (10acooper) Approved [14:34:56] thx [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:43] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:37:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) (owner: 10Gergő Tisza) [14:38:36] (03Merged) 10jenkins-bot: Allow simulating the SUL3 shared domain settings via env var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) (owner: 10Gergő Tisza) [14:39:01] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1095082|Allow simulating the SUL3 shared domain settings via env var (T380575)]] [14:39:06] T380575: Make SUL3 authentication domain mode available from CLI - https://phabricator.wikimedia.org/T380575 [14:40:09] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:41:24] kamila_ nice, thanks for the info! [14:43:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:43:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns7001.wikimedia.org with OS bullseye [14:43:09] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bu... [14:43:46] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [14:43:54] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10357890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with O... [14:44:53] !log tgr@deploy2002 tgr: Backport for [[gerrit:1095082|Allow simulating the SUL3 shared domain settings via env var (T380575)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:57] T380575: Make SUL3 authentication domain mode available from CLI - https://phabricator.wikimedia.org/T380575 [14:47:03] PROBLEM - Host doh7002 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:03] PROBLEM - Host ncredir7002 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:07] PROBLEM - Host prometheus7001 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:11] PROBLEM - Host bast7001 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:15] PROBLEM - Host ganeti7004 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:15] PROBLEM - Host durum7002 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:35] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:03] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:51:25] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:39] PROBLEM - Host 2a02:ec80:700:1:195:200:68:4 is DOWN: CRITICAL - Host Unreachable (2a02:ec80:700:1:195:200:68:4) [14:51:58] (03PS5) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [14:52:47] RECOVERY - Host ganeti7004 is UP: PING OK - Packet loss = 0%, RTA = 115.21 ms [14:53:31] PROBLEM - Recursive DNS on 195.200.68.4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:55:24] (03CR) 10DCausse: "should be ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [14:55:29] I assume these are due to the magru work? [14:55:33] RECOVERY - Host doh7002 is UP: PING OK - Packet loss = 0%, RTA = 115.80 ms [14:55:33] RECOVERY - Host ncredir7002 is UP: PING OK - Packet loss = 0%, RTA = 115.62 ms [14:55:37] RECOVERY - Host prometheus7001 is UP: PING OK - Packet loss = 0%, RTA = 115.12 ms [14:55:41] RECOVERY - Host bast7001 is UP: PING OK - Packet loss = 0%, RTA = 115.96 ms [14:55:47] RECOVERY - Host durum7002 is UP: PING OK - Packet loss = 0%, RTA = 115.55 ms [14:56:45] !log tgr@deploy2002 tgr: Continuing with sync [14:56:47] PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:56:47] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:57:47] PROBLEM - Bird Internet Routing Daemon on doh7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:57:47] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:58:09] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh[7001-7002].wikimedia.org,durum[7001-7002].magru.wmnet with reason: site is depooled, maintenance [14:58:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh[7001-7002].wikimedia.org,durum[7001-7002].magru.wmnet with reason: site is depooled, maintenance [14:58:32] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on doh[7001-7002].wikimedia.org,durum[7001-7002].magru.wmnet with reason: site is depooled, maintenance [14:58:34] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:58:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on doh[7001-7002].wikimedia.org,durum[7001-7002].magru.wmnet with reason: site is depooled, maintenance [14:58:48] RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:58:48] RECOVERY - Bird Internet Routing Daemon on doh7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:58:48] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7002 is OK: OK: UP (pid=2330) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:58:48] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2357) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:59:04] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:05] thank you sukhe <3 [14:59:26] sorry about the noise. but trying to balance the fine line between getting alerts so we know what's going on vs them being spammy. [15:03:19] jouncebot: nowandnext [15:03:19] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [15:03:19] In 0 hour(s) and 56 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1600) [15:03:39] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:03:48] dcausse: just finishing scap [15:04:06] tgr|away: ack [15:04:22] (my deploy is unrelated to mw tho) [15:05:00] (03Merged) 10jenkins-bot: rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:05:25] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1095082|Allow simulating the SUL3 shared domain settings via env var (T380575)]] (duration: 26m 23s) [15:05:30] T380575: Make SUL3 authentication domain mode available from CLI - https://phabricator.wikimedia.org/T380575 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:46] (03CR) 10Btullis: wdqs102[567]: install OS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [15:07:54] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [15:08:24] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:08:43] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:09:32] jouncebot: nowandnext [15:09:32] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [15:09:33] In 0 hour(s) and 50 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1600) [15:11:02] !log UTC afternoon deploys done [15:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7001.wikimedia.org with reason: host reimage [15:11:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098037 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [15:12:14] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "wikikube: Add wikikube-worker13[13-28]" [puppet] - 10https://gerrit.wikimedia.org/r/1097981 (owner: 10Clément Goubert) [15:13:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098041 (https://phabricator.wikimedia.org/T379678) (owner: 10Elukey) [15:15:23] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098033 (https://phabricator.wikimedia.org/T379173) (owner: 10Elukey) [15:15:41] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for wquarshie [puppet] - 10https://gerrit.wikimedia.org/r/1098020 (owner: 10Muehlenhoff) [15:16:10] PROBLEM - Recursive DNS on 195.200.68.4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:16:44] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:16:59] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:18:26] ^ that's magru as well [15:19:11] (03CR) 10Btullis: [C:03+2] Add a thirdparty/bigtop33 component to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1098037 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [15:19:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [15:20:04] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357974 (10VRiley-WMF) Has this still been performing as expected? If so, are we able to close it? [15:20:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10357975 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [15:21:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:42] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [15:22:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [15:22:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10357981 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [15:24:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [15:25:11] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:25:32] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10357986 (10isarantopoulos) 05Open→03Resolved [15:26:10] PROBLEM - Recursive DNS on 2a02:ec80:700:1:195:200:68:4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:26:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:34] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:28:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:30:05] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357993 (10dcaro) Looks good on my side 👍 [15:33:30] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10357997 (10VRiley-WMF) 05Open→03Resolved [15:33:32] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2215 gradually with 4 steps - Maint over [15:33:36] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db2215 gradually with 4 steps - Maint over [15:33:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:34:01] (03CR) 10Majavah: [C:03+2] dynamicproxy: Bind Redis on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091849 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [15:34:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE) [15:34:43] !log installing wireshark security updates [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [15:37:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:39:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1026.eqiad.wmnet with OS bullseye [15:39:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye [15:39:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye [15:39:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10358021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors... [15:39:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10358022 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors... [15:40:05] (03PS1) 10Majavah: dynamicproxy: Fix listen on old bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1098066 [15:40:10] RECOVERY - Recursive DNS on 195.200.68.4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:40:10] RECOVERY - Recursive DNS on 2a02:ec80:700:1:195:200:68:4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:40:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10358023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors... [15:40:42] (03CR) 10CI reject: [V:04-1] dynamicproxy: Fix listen on old bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1098066 (owner: 10Majavah) [15:41:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:54] (03PS2) 10Majavah: dynamicproxy: Fix listen on old bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1098066 [15:42:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns7001.wikimedia.org with OS bookworm [15:42:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:42:51] (03CR) 10Majavah: [C:03+2] dynamicproxy: Fix listen on old bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1098066 (owner: 10Majavah) [15:42:52] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10358029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bo... [15:42:53] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10358028 (10MoritzMuehlenhoff) [15:45:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1321.eqiad.wmnet with OS bookworm [15:46:26] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:46:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1322.eqiad.wmnet with OS bookworm [15:46:34] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:46:36] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for more Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1094448 (owner: 10Muehlenhoff) [15:47:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1323.eqiad.wmnet with OS bookworm [15:47:57] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Openstack roles [puppet] - 10https://gerrit.wikimedia.org/r/1094434 (owner: 10Muehlenhoff) [15:48:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1324.eqiad.wmnet with OS bookworm [15:48:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1325.eqiad.wmnet with OS bookworm [15:49:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1326.eqiad.wmnet with OS bookworm [15:49:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1327.eqiad.wmnet with OS bookworm [15:50:53] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10358056 (10MoritzMuehlenhoff) [15:52:19] !log installing intel-microcode security updates [15:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:22] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4600/co" [puppet] - 10https://gerrit.wikimedia.org/r/1097333 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:53:05] (03CR) 10Elukey: [C:03+2] admins: add ssh access for user kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1098033 (https://phabricator.wikimedia.org/T379173) (owner: 10Elukey) [15:53:13] (03CR) 10Elukey: [C:03+2] admin: add dbrant to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1098041 (https://phabricator.wikimedia.org/T379678) (owner: 10Elukey) [15:56:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10358074 (10elukey) 05Open→03Resolved a:03elukey Merged! The new access permissions will be deployed during the next hour by puppet on all the sta... [15:56:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10358077 (10elukey) [15:58:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10358064 (10elukey) 05Open→03Resolved a:03elukey Merged! Puppet needs to run in various hosts to propagate the permissions but in ~1hour we should be good. Closin... [15:59:43] (03CR) 10Alexandros Kosiaris: [C:03+2] rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [16:00:04] eoghan, jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1600). [16:02:16] PROBLEM - Host dns7002 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:14] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:03:28] (03PS1) 10C. Scott Ananian: Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) [16:03:40] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:06:42] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:07:24] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10358123 (10elukey) Confirmed it is legit after a chat on Slack :) [16:09:52] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.075e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:10:40] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:58] FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:17] sigh [16:11:23] !incidents [16:11:23] 5477 (ACKED) [10x] ProbeDown sre (probes/service magru) [16:11:23] 5475 (RESOLVED) [7x] ProbeDown sre (probes/service magru) [16:11:41] creating a silence for 6 hours [16:11:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:43] hello [16:11:43] we can remove it later [16:11:47] rzl: this is expected [16:11:48] rzl: it's me [16:11:48] sorry for the noise [16:12:05] I couldn't resist lol [16:12:09] thanks <3 reading back now [16:13:21] silence created for 8 hours [16:14:17] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10358138 (10cmooney) 05Resolved→03Open >>! In T380503#10357974, @VRiley-WMF wrote: > Has this still been performing as... [16:14:41] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098079 [16:15:13] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10358181 (10elukey) ` elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep sspalding member: uid=sspalding,ou=people,dc=wikimedia,dc=org ` You should now be able to access https://turnilo.wikimedia.o... [16:15:32] RECOVERY - Host cp7010 is UP: PING OK - Packet loss = 0%, RTA = 115.40 ms [16:16:21] (03CR) 10Subramanya Sastry: rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [16:16:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:42] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:19:11] FIRING: [22x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:41] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1323.eqiad.wmnet with OS bookworm [16:20:57] RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:25] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: thanos-be1005 and thanos-be2005 serial console not available over ssh - https://phabricator.wikimedia.org/T380883 (10MatthewVernon) 03NEW [16:21:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:22:40] (03CR) 10Ssingh: [C:03+2] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [16:22:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1323.eqiad.wmnet with OS bookworm [16:23:40] (03CR) 10Btullis: [C:03+1] wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [16:25:06] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission restbase202[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T380790#10358247 (10Papaul) [16:26:51] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1322.eqiad.wmnet with OS bookworm [16:26:59] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission restbase202[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T380790#10358251 (10Papaul) 05Open→03Resolved complete [16:27:23] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1324.eqiad.wmnet with OS bookworm [16:27:37] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1325.eqiad.wmnet with OS bookworm [16:27:49] (03CR) 10Brouberol: [C:03+1] datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [16:27:50] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1326.eqiad.wmnet with OS bookworm [16:28:04] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1327.eqiad.wmnet with OS bookworm [16:28:32] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:29:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:29:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1322.eqiad.wmnet with OS bookworm [16:29:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1324.eqiad.wmnet with OS bookworm [16:30:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1325.eqiad.wmnet with OS bookworm [16:30:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1326.eqiad.wmnet with OS bookworm [16:30:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1327.eqiad.wmnet with OS bookworm [16:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:35:11] (03CR) 10Subramanya Sastry: [C:03+1] Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [16:37:27] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: thanos-be1005 and thanos-be2005 serial console not available over ssh - https://phabricator.wikimedia.org/T380883#10358327 (10elukey) @MatthewVernon it works for me, I think you are missing a `start`: elukey@cumin1002:~$ ssh root@thanos-be2... [16:40:43] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: thanos-be1005 and thanos-be2005 serial console not available over ssh - https://phabricator.wikimedia.org/T380883#10358331 (10MatthewVernon) 05Open→03Invalid Oh, bother, yes, my bad, I'd obviously mis-noted-down the rune. Apologies f... [16:40:51] !log `mwscript-k8s -f userOptions.php -- --wiki=enwiki --old=control --delete 'growthexperiments-homepage-variant'` # T379146, T377631 [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] T379146: Review remaining growthexperiments-homepage-variant rows at enwiki - https://phabricator.wikimedia.org/T379146 [16:41:02] T377631: Add a link (Structured task): Release to a subset of newcomers on English Wikipedia - https://phabricator.wikimedia.org/T377631 [16:41:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1323.eqiad.wmnet with reason: host reimage [16:42:37] (03CR) 10JMeybohm: [V:03+1 C:03+1] "Can't speak to timeout and keepalive values. But the change lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1097333 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:42:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:42:58] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:43:03] (03CR) 10Stevemunene: [C:03+1] airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [16:45:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1323.eqiad.wmnet with reason: host reimage [16:45:18] 10ops-codfw, 06SRE, 06cloud-services-team, 06Data-Persistence, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358358 (10Papaul) [16:46:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:48:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1322.eqiad.wmnet with reason: host reimage [16:48:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1324.eqiad.wmnet with reason: host reimage [16:49:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1325.eqiad.wmnet with reason: host reimage [16:49:21] 10ops-codfw, 06SRE, 06cloud-services-team, 06Data-Persistence, and 3 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10358361 (10Papaul) 05Open→03Resolved a:03Papaul [16:49:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1326.eqiad.wmnet with reason: host reimage [16:49:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1327.eqiad.wmnet with reason: host reimage [16:51:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1322.eqiad.wmnet with reason: host reimage [16:53:14] (03PS1) 10JMeybohm: k8s.reboot-nodes: Limit allowed aliases to those of the k8s cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 [16:53:34] RECOVERY - MariaDB Replica Lag: s8 on an-redacteddb1001 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:34] RECOVERY - MariaDB Replica SQL: s8 on an-redacteddb1001 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1327.eqiad.wmnet with reason: host reimage [16:54:52] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10358380 (10Andrew) [16:58:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1324.eqiad.wmnet with reason: host reimage [16:59:19] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{wikikube-ctrl100[1-3].eqiad.wmnet} and (A:wikikube-worker-eqiad or A:wikikube-master-eqiad) [17:00:05] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1700) [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:59] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10358450 (10ops-monitoring-bot) Started rebooting nodes in wikikube-eqiad cluster: * wikikube-ctrl[1001-1003].eqiad.wmnet [17:01:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1326.eqiad.wmnet with reason: host reimage [17:02:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:28] that's me rebooting [17:03:47] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10358493 (10Dzahn) I was just trying to answer questions why this is the case and which of the 2 options is the newer/right way to handle it. [17:03:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1323.eqiad.wmnet with OS bookworm [17:04:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1325.eqiad.wmnet with reason: host reimage [17:05:13] !log ladsgroup@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [17:06:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1321.eqiad.wmnet with reason: host reimage [17:09:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1321.eqiad.wmnet with reason: host reimage [17:10:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1322.eqiad.wmnet with OS bookworm [17:11:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:18] PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp7012.magru.wmnet, cp7014.magru.wmnet, cp7016.magru.wmnet are marked down but pooled: uploadlb_443: Servers cp7012.magru.wmnet, cp7014.magru.wmnet, cp7016.magru.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:11:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:13:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1327.eqiad.wmnet with OS bookworm [17:14:16] (03PS7) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [17:16:19] (03PS1) 10Brouberol: airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) [17:17:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1324.eqiad.wmnet with OS bookworm [17:17:30] (03PS4) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [17:20:09] (03PS1) 10Andrew Bogott: Remove ceph references to cloudcephosd100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) [17:20:11] (03PS1) 10Andrew Bogott: Remove refs to cloudcephmon100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098096 (https://phabricator.wikimedia.org/T380893) [17:20:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:20:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:20:30] (03CR) 10Fabfur: "Edited parent commit to address changes in the correct CR and rebased this" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:20:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1326.eqiad.wmnet with OS bookworm [17:23:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1325.eqiad.wmnet with OS bookworm [17:25:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{wikikube-ctrl100[1-3].eqiad.wmnet} and (A:wikikube-worker-eqiad or A:wikikube-master-eqiad) [17:28:55] 10ops-magru: PowerSupplyFailure - https://phabricator.wikimedia.org/T380897 (10phaultfinder) 03NEW [17:28:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1321.eqiad.wmnet with OS bookworm [17:31:55] !log homer 'lsw1-f7-eqiad*' commit 'T380350' [17:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:00] T380350: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350 [17:32:56] !log homer 'lsw1-e6-eqiad*' commit 'T380350' [17:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:26] !log homer 'lsw1-e5-eqiad*' commit 'T380350' [17:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:00] !log homer 'lsw1-f5-eqiad*' commit 'T380350' [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:28] !log homer 'lsw1-f6-eqiad*' commit 'T380350' [17:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:01] !log homer 'lsw1-e7-eqiad*' commit 'T380350' [17:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:30] !log homer 'cr*eqiad*' commit 'T380350' [17:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] PROBLEM - Host wikikube-worker2032 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:04] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:38:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:38:29] hmm [17:39:16] RECOVERY - Host wikikube-worker2032 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [17:40:04] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 238, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 322, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:41] well it crashed and rebooted, ok [17:42:52] RECOVERY - Host dns7002 is UP: PING OK - Packet loss = 0%, RTA = 115.14 ms [17:42:58] PROBLEM - Bird Internet Routing Daemon on dns7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:43:21] you say "flaky," I say "autonomous self-healing" [17:43:50] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:43:59] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:55] (03PS1) 10Ilias Sarantopoulos: ml-services: increase memory in rec api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098103 [17:45:16] PROBLEM - NTP peers and stratum check on dns7002 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown, stratum=-1 (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [17:45:27] yeah [17:45:42] PROBLEM - Check whether ferm is active by checking the default input chain on dns7002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:46:03] there is nothing actionable here so far and nothing to worry about [17:46:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1313-1327].eqiad.wmnet [17:47:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1313-1327].eqiad.wmnet [17:47:50] 06SRE, 10Parsoid, 10wikitech.wikimedia.org: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899 (10fnegri) 03NEW [17:47:55] (03Abandoned) 10Ilias Sarantopoulos: ml-services: increase memory in rec api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098103 (owner: 10Ilias Sarantopoulos) [17:55:16] RECOVERY - NTP peers and stratum check on dns7002 is OK: NTP OK: Offset -0.000914055 secs, stratum=1 https://wikitech.wikimedia.org/wiki/NTP [17:55:27] (03PS15) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [17:55:34] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:55:35] startum of 1 is nice. [17:55:52] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns7002 is OK: OK: UP (pid=20499) and all threads (3) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:55:54] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:56:00] RECOVERY - Bird Internet Routing Daemon on dns7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:56:12] (03CR) 10Hnowlan: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:56:37] (03CR) 10CI reject: [V:04-1] mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:57:54] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10358768 (10JMeybohm) [17:58:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:58:56] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{wikikube-ctrl200[1-3].codfw.wmnet} and (A:wikikube-worker-codfw or A:wikikube-master-codfw) [17:59:51] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10358779 (10ops-monitoring-bot) Started rebooting nodes in wikikube-codfw cluster: * wikikube-ctrl[2001-2003].codfw.wmnet [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T1800) [18:00:10] PROBLEM - Host 2a02:ec80:700:102:195:200:68:37 is DOWN: CRITICAL - Host Unreachable (2a02:ec80:700:102:195:200:68:37) [18:01:52] 06SRE, 10Parsoid, 10wikitech.wikimedia.org: Parsoid renders "Incident status" (wikitech) incorrectly - https://phabricator.wikimedia.org/T380899#10358783 (10ssastry) This may just be {T356718} which might be resolvable soon. [18:03:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CodeMirror] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097591 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [18:04:08] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:05:08] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:07:54] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473#10358812 (10Papaul) [18:10:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2215.codfw.wmnet with reason: Maintenance [18:10:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2215.codfw.wmnet with reason: Maintenance [18:11:00] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission parse20[01-20] - https://phabricator.wikimedia.org/T380473#10358814 (10Papaul) 05Open→03Resolved a:03Papaul This complete [18:12:07] (03PS3) 10Bking: wdqs102[567]: install OS [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) [18:12:46] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:12:54] (03CR) 10Bking: wdqs102[567]: install OS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [18:14:46] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:42] RECOVERY - Check whether ferm is active by checking the default input chain on dns7002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:21:07] (03PS16) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [18:21:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 [18:24:29] (03CR) 10Bking: [C:03+2] wdqs102[567]: install OS [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [18:25:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{wikikube-ctrl200[1-3].codfw.wmnet} and (A:wikikube-worker-codfw or A:wikikube-master-codfw) [18:25:22] (03CR) 10Bking: [V:03+2 C:03+2] "Self-merging after addressing comment; this should not affect other hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1097564 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [18:30:22] PROBLEM - Host dns7002 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:58] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:36] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:33:19] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2215 gradually with 4 steps - Maint over [18:34:17] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db2215 gradually with 4 steps - Maint over [18:34:42] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:35:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2215 repool', diff saved to https://phabricator.wikimedia.org/P71187 and previous config saved to /var/cache/conftool/dbconfig/20241126-183547-ladsgroup.json [18:35:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P71188 and previous config saved to /var/cache/conftool/dbconfig/20241126-183556-ladsgroup.json [18:36:26] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [18:36:34] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [18:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:39:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:48:34] ACKNOWLEDGEMENT - MD RAID on cp7004 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T380905 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:49:13] 10ops-magru, 06SRE: Degraded RAID on cp7004 - https://phabricator.wikimedia.org/T380905 (10ops-monitoring-bot) 03NEW [18:51:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P71189 and previous config saved to /var/cache/conftool/dbconfig/20241126-185101-ladsgroup.json [18:55:24] (03CR) 10Krinkle: Introduce preinstall.dblist for wikis that haven't been installed yet (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [18:55:52] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti7001.magru.wmnet [18:58:31] (03PS1) 10Dzahn: aptrepo: allow gitlab package upgrades up to version 17.5 [puppet] - 10https://gerrit.wikimedia.org/r/1098117 [19:01:12] RECOVERY - Host 2a02:ec80:700:102:195:200:68:37 is UP: PING OK - Packet loss = 0%, RTA = 115.17 ms [19:02:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) (owner: 10Novem Linguae) [19:03:06] (03CR) 10Dzahn: [C:03+2] etherpad: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092839 (owner: 10Muehlenhoff) [19:03:54] PROBLEM - Recursive DNS on 2a02:ec80:700:102:195:200:68:37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:06:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P71190 and previous config saved to /var/cache/conftool/dbconfig/20241126-190607-ladsgroup.json [19:06:47] (03CR) 10Dzahn: [C:03+2] "still working fine after deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1092839 (owner: 10Muehlenhoff) [19:07:33] (03CR) 10Dzahn: [C:03+1] "Yea, I think so. With low priority but it does seem best to contact them to make sure." [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [19:08:30] (03CR) 10Dzahn: [C:03+2] doc: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092825 (owner: 10Muehlenhoff) [19:11:06] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:12:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:12:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye [19:12:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [19:12:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye [19:12:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye [19:12:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye [19:16:19] (03PS1) 10Hashar: Avoid exception on mTemplateIds/mTemplate array discrepancy [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098120 (https://phabricator.wikimedia.org/T380862) [19:19:07] 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908 (10kostajh) 03NEW [19:20:47] (03CR) 10Dzahn: [C:03+2] "yep, working normal after deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1092825 (owner: 10Muehlenhoff) [19:21:10] (03CR) 10Dzahn: [C:03+2] aptrepo: allow gitlab package upgrades up to version 17.5 [puppet] - 10https://gerrit.wikimedia.org/r/1098117 (owner: 10Dzahn) [19:21:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P71191 and previous config saved to /var/cache/conftool/dbconfig/20241126-192112-ladsgroup.json [19:23:15] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [19:23:48] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [19:23:49] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:23:50] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti7001.magru.wmnet [19:23:58] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359109 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `ganeti7001.magru.wmnet` - ganeti70... [19:26:44] !log mwscript-k8s -f userOptions.php -- --wiki=enwiki --old=oldimpact --delete 'growthexperiments-homepage-variant' # T379146 [19:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:49] T379146: Review remaining growthexperiments-homepage-variant rows at enwiki - https://phabricator.wikimedia.org/T379146 [19:27:49] !log [urbanecm@mwmaint2002 ~]$ foreachwiki userOptions.php --delete-defaults growthexperiments-homepage-variant # T379146, logging to /home/urbanecm/T379146.log [19:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:56] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7003.magru.wmnet [19:29:53] PROBLEM - Host lvs7001 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:33] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:30:35] PROBLEM - Host ncredir7002 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:37] PROBLEM - Host prometheus7001 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:43] PROBLEM - Host bast7001 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:43] PROBLEM - Host ganeti7004 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:54] all expected [19:33:29] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:33:46] thanks for saying that [19:34:39] problem is that the situation in magru is quite complicated [19:34:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:44] so we can't downtime all the site [19:34:54] and we cannot selectively downtime hosts as they are being swapped [19:34:59] so sadly, we wil have to deal with this [19:36:03] RECOVERY - Host ganeti7004 is UP: PING OK - Packet loss = 0%, RTA = 115.15 ms [19:36:25] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:38] jouncebot: nextandnow [19:36:47] jouncebot: nowandnext [19:36:47] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [19:36:47] In 1 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T2100) [19:36:48] ... [19:37:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098120 (https://phabricator.wikimedia.org/T380862) (owner: 10Hashar) [19:40:29] RECOVERY - Host ncredir7002 is UP: PING OK - Packet loss = 0%, RTA = 116.00 ms [19:40:35] RECOVERY - Host prometheus7001 is UP: PING OK - Packet loss = 0%, RTA = 115.43 ms [19:40:38] ok, last pair checks out [19:40:43] RECOVERY - Host bast7001 is UP: PING OK - Packet loss = 0%, RTA = 115.45 ms [19:42:53] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [19:42:57] FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:43:00] sigh [19:43:03] !incidents [19:43:04] 5478 (UNACKED) [10x] ProbeDown sre (probes/service magru) [19:43:04] 5477 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [19:43:04] 5475 (RESOLVED) [7x] ProbeDown sre (probes/service magru) [19:43:07] !ack 5478 [19:43:08] 5478 (ACKED) [10x] ProbeDown sre (probes/service magru) [19:43:12] I promise I downtimed this guys [19:43:22] :( [19:43:22] haha no worries [19:43:29] karma always gets to me [19:43:30] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [19:43:30] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:31] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7003.magru.wmnet [19:43:38] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359240 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7003.magru.wmnet` - cp7003.magru... [19:43:41] > rt was acknowledged using karma on Tue, 26 Nov 2024 16:12:48 GMT — sukhe Expires in 5 hours [19:43:48] expires in five hours and still paged [19:45:11] jouncebot: nowandnext [19:45:12] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [19:45:12] In 1 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T2100) [19:45:15] ok, it seemed like I only did port 80 so did 443 as well. sorry about that. [19:46:06] hashar: were you planning to deploy something? I see you asking jouncebot the same question in the backscroll :) [19:46:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:46:56] yeah I am pushing up a fix to some PHP notices happening on 1.44.0-wmf.5 [19:46:59] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:47:02] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Alert in need of triage: SmartNotHealthy (instance stat1011:9100) - https://phabricator.wikimedia.org/T380835#10359265 (10bking) This is an interesting one...`zram0` is a compressed RAMdisk, so it should not be in scope for any SMART (hard driv... [19:47:26] swfrench-wmf: ahead of the late window, which is well... too late [19:47:50] https://integration.wikimedia.org/zuul/#q=1098120 [19:47:57] CI ETA 9 minutes [19:48:08] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10359276 (10JMeybohm) [19:49:09] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7003 [19:49:25] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7003 [19:49:32] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7001 [19:49:46] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7001 [19:49:56] hashar: ah, I missed the approval of 1098120 in the backscroll somehow. I'll stay out of your way for the moment. if you wouldn't mind pinging me when you're done, that would be greatly appreciated. [19:50:15] swfrench-wmf: did you get a mediawiki patches to push? [19:50:16] "no deployments scheduled, but h.ashar asked me the same thing 9 minutes ago" would be a great j.ouncebot feature actually [19:50:19] patch [19:50:52] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [19:50:57] rzl: lol [19:51:01] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [19:51:01] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:27] rzl: I am not sure who maintains jouncebot , but you can probably file that in Phab against #jouncebot [19:51:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:52:01] hashar: yeah, I have a mediawiki-config backport that (while trivial and low-risk) might be good to get in _outside_ of the normal backport window [19:52:18] ahhh true [19:52:24] well I will let you know [19:52:32] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:52:33] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7003.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:52:42] take your time - what you're doing is more urgent :) [19:55:16] (03Merged) 10jenkins-bot: Avoid exception on mTemplateIds/mTemplate array discrepancy [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098120 (https://phabricator.wikimedia.org/T380862) (owner: 10Hashar) [19:55:35] ah [19:55:46] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1098120|Avoid exception on mTemplateIds/mTemplate array discrepancy (T380862)]] [19:55:50] T380862: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T380862 [19:59:39] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:00:23] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359342 (10RobH) [20:00:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7003.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:01:24] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359353 (10RobH) [20:02:20] !log hashar@deploy2002 hashar: Backport for [[gerrit:1098120|Avoid exception on mTemplateIds/mTemplate array discrepancy (T380862)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:02:22] 20:01:10 K8s deployment progress: 100% (ok: 12; fail: 0; left: 0) [20:02:22] 20:01:10 Finished sync-testservers-k8s (duration: 04m 21s) [20:02:24] T380862: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T380862 [20:02:27] I am still wondering what takes so long :b [20:02:36] !log hashar@deploy2002 hashar: Continuing with sync [20:04:09] (03PS1) 10Aklapper: phabricator weekly changes email: List Deadline tasks without Due Date [puppet] - 10https://gerrit.wikimedia.org/r/1098132 (https://phabricator.wikimedia.org/T380915) [20:04:13] (03PS1) 10Bking: wdqs102[567]: move back to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1098133 (https://phabricator.wikimedia.org/T378030) [20:04:49] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti7002.magru.wmnet [20:04:51] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7004.magru.wmnet [20:05:04] (03CR) 10Dzahn: [C:03+2] "quick deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1098132 (https://phabricator.wikimedia.org/T380915) (owner: 10Aklapper) [20:07:25] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:07:30] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:08:19] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:09:08] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10359381 (10BCornwall) Of course - there's always five different ways to do it (four of which are deprecated) in Debian-land so it's good to get all that sorted. [20:09:28] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:10:45] (03CR) 10Dzahn: [C:03+2] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1098132 (https://phabricator.wikimedia.org/T380915) (owner: 10Aklapper) [20:11:09] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098120|Avoid exception on mTemplateIds/mTemplate array discrepancy (T380862)]] (duration: 15m 23s) [20:11:20] swfrench-wmf: I have deployed my patch [20:11:53] T380862: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T380862 [20:12:00] hashar: thanks! is that it, or do you need to get anything else in before the backport window starts? [20:13:03] nop I am done [20:13:07] and confirmed the log is gone [20:13:16] so you can go ahead [20:13:20] awesome, thank you! [20:13:39] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7002.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [20:13:54] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti7002.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [20:13:54] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:55] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti7002.magru.wmnet [20:14:00] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359385 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `ganeti7002.magru.wmnet` - ganeti70... [20:14:19] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:14:49] FYI, I'll get started on my config backport momentarily [20:15:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [20:15:59] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:16:41] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:16:42] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7004.magru.wmnet [20:16:46] (03Merged) 10jenkins-bot: debug.json: add support for mwdebug-next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [20:16:50] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7004.magru.wmnet` - cp7004.magru... [20:17:13] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1076848|debug.json: add support for mwdebug-next (T372605)]] [20:17:17] T372605: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605 [20:21:23] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241126 [20:21:29] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:08] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:23:16] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1076848|debug.json: add support for mwdebug-next (T372605)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:24] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:23:28] T372605: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605 [20:23:41] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7002 [20:23:52] !log robh@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti7002 [20:24:47] !log swfrench@deploy2002 swfrench: Continuing with sync [20:25:00] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7002 [20:25:03] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7002 [20:25:22] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:49] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359418 (10BCornwall) [20:26:17] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [20:26:23] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [20:26:23] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:26:39] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7004.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:26:49] dzahn@cumin2002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [20:27:51] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359434 (10RobH) [20:31:34] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076848|debug.json: add support for mwdebug-next (T372605)]] (duration: 14m 21s) [20:31:49] T372605: Extend x-wikimedia-debug-routing.lua to support PHP 8.1 mw-debug deployment - https://phabricator.wikimedia.org/T372605 [20:32:22] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:32:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye [20:32:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1026.eqiad.wmnet with OS bullseye [20:32:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye [20:32:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors... [20:32:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors... [20:32:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 5 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10359463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors... [20:34:03] all done - no further deployments on my end [20:34:08] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7004.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:34:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359469 (10RobH) [20:34:34] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359472 (10RobH) [20:35:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS bullseye [20:37:19] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns7002.wikimedia.org [20:37:26] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7002.magru.wmnet [20:39:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241126 [20:43:33] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:44:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS bullseye [20:47:03] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [20:47:20] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [20:47:21] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:22] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dns7002.wikimedia.org [20:47:32] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359553 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `dns7002.wikimedia.org` - dns7002.w... [20:47:51] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:49:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [20:50:09] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:10] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7002.magru.wmnet [20:50:15] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359556 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7002.magru.wmnet` - cp7002.magru... [20:50:50] (03CR) 10Dzahn: "Error: no parameter named 'max_files'." [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [20:51:42] FIRING: [3x] JobUnavailable: Reduced availability for job haproxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:53:22] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [20:54:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [20:54:50] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:54:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [20:58:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241126T2100). [21:00:05] Jdlrobson, musikanimal, and NovemLinguae: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] o/ [21:00:54] hello! [21:01:03] today the role of Jdlrobson will be played by me [21:01:08] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7002 [21:01:24] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7002 [21:01:26] 🤘 [21:01:28] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns7002 [21:01:41] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [21:01:43] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns7002 [21:01:50] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [21:01:50] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:02:00] Do we have a deployer today? [21:02:32] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:02:34] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:02:51] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp7003.magru.wmnet cp7004.magru.wmnet on all recursors [21:02:54] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp7003.magru.wmnet cp7004.magru.wmnet on all recursors [21:03:17] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359591 (10RobH) [21:03:17] I hope so because no one showed up for the morning backport window, either :( [21:03:37] I can just do it [21:03:40] yay! [21:03:54] Thank you! [21:04:16] (03CR) 10Reedy: [C:03+2] Add BetaFeature for CodeMirror 6 [extensions/CodeMirror] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097591 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [21:04:27] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:04:35] (03PS2) 10Novem Linguae: enwiki: add "mergehistory" to "import" user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) [21:04:39] (03CR) 10Reedy: [C:03+2] enwiki: add "mergehistory" to "import" user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) (owner: 10Novem Linguae) [21:04:47] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns7002.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:05:23] (03Merged) 10jenkins-bot: enwiki: add "mergehistory" to "import" user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097402 (https://phabricator.wikimedia.org/T380753) (owner: 10Novem Linguae) [21:06:03] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359593 (10RobH) [21:08:18] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs7001.magru.wmnet [21:08:25] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp7010.magru.wmnet [21:08:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7004.magru.wmnet with OS bullseye [21:10:01] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS bullseye [21:14:28] (03Merged) 10jenkins-bot: Add BetaFeature for CodeMirror 6 [extensions/CodeMirror] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097591 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [21:15:31] (03PS2) 10Jdlrobson: Nov 26 2024: Vector 2022 Deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) [21:15:54] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:16:02] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [21:17:04] (03CR) 10Ssingh: [C:03+1] "You have the context but looks good based on that!" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [21:17:20] !log reedy@deploy2002 Synchronized wmf-config/core-Permissions.php: T380753 (duration: 11m 23s) [21:17:25] T380753: Add "mergehistory" permission to "import" group on English Wikipedia - https://phabricator.wikimedia.org/T380753 [21:19:24] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7010.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [21:20:18] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7010.magru.wmnet decommissioned, removing all IPs except the asset tag one - robh@cumin2002" [21:20:18] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:19] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp7010.magru.wmnet [21:20:21] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:20:21] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1097591|Add BetaFeature for CodeMirror 6 (T376735)]] [21:20:30] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359620 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `cp7010.magru.wmnet` - cp7010.magru... [21:20:48] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [21:21:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [21:21:38] (03CR) 10Ryan Kemper: [C:03+2] wdqs102[567]: move back to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1098133 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [21:22:38] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:22:39] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts lvs7001.magru.wmnet [21:22:43] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359631 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `lvs7001.magru.wmnet` - lvs7001.mag... [21:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:23:28] reedy: Everything looks good in en:Special:UserGroupRights. Thank you! [21:25:50] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:26:07] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [21:27:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:28:07] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:35] !log robh@cumin2002 START - Cookbook sre.dns.netbox [21:29:16] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [21:29:49] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp7010 [21:30:04] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp7010 [21:30:12] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs7001 [21:30:16] musikanimal: this is what you get for backporting i18n changes ;) [21:30:25] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [21:30:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7003.magru.wmnet with OS bullseye [21:30:26] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs7001 [21:30:29] uh oh! [21:30:33] I mean it's slow :) [21:30:52] oh okay, phew [21:30:55] slow is fine, lol [21:31:01] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/fb18a7c447b10307f16fe5b5844f8f76ac0ce335c5c8c122efd3154d2f280968/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [21:31:15] thank you for doing this, by the way! we missed the train by only minutes [21:32:08] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [21:32:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [21:32:13] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru shuffle - robh@cumin2002" [21:32:14] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:32:51] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp7010.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:32:55] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:33:38] should've done the other config one first :D [21:33:47] oh, there we go [21:33:50] it's on the move again [21:34:29] test servers incoming [21:34:47] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7010.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:35:19] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs7001.mgmt.magru.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:35:21] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp7002.magru.wmnet dns7002.magru.wmnet on all recursors [21:35:24] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp7002.magru.wmnet dns7002.magru.wmnet on all recursors [21:35:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7002.wikimedia.org with OS bookworm [21:35:50] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7002.magru.wmnet with OS bullseye [21:35:57] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359683 (10RobH) [21:37:12] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359684 (10BCornwall) [21:38:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [21:39:49] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359697 (10RobH) @MoritzMuehlenhoff : ganeti700[12] are ready for reimage but I've just run out of steam for today. If you don't get to... [21:43:59] !log reedy@deploy2002 musikanimal, reedy: Backport for [[gerrit:1097591|Add BetaFeature for CodeMirror 6 (T376735)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:03] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [21:44:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:34] which test server should I be looking at? [21:45:17] I've just realised, this isn't actually testable as it, is it? [21:45:24] As codemirror-beta-feature-enable isn't in $wgBetaFeaturesAllowList [21:45:31] (yet) [21:45:38] oh for real!? [21:45:53] >// DO NOT add entries here without OK from Greg Grossmeier or James Forrester. [21:45:55] FIRING: MaxConntrack: Max conntrack at 91.15% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:46:05] James didn't mention any of that in code review, but I guess that explains why other extensions had a feature flag for the beta feature [21:46:07] greg-g: ^ Think we need to change you for Tyler or someone? :P [21:46:21] !log reedy@deploy2002 musikanimal, reedy: Continuing with sync [21:48:29] (03CR) 10Reedy: [C:03+2] Nov 26 2024: Vector 2022 Deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) (owner: 10Jdlrobson) [21:48:34] so I have a lot of people waiting on this, today was supposed to be the big day… gonna see if I can the feature added to the allowlist. Anyway thanks for doing this sync Reedy! maybe I'll see you at the next backport window [21:48:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:48:59] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 97 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:49:05] musikanimal: If it's written somewhere it's approved... We can deploy the flag in wmf-config after ^ [21:49:10] (03Merged) 10jenkins-bot: Nov 26 2024: Vector 2022 Deployments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097484 (https://phabricator.wikimedia.org/T379799) (owner: 10Jdlrobson) [21:50:04] I got the go-ahead from Editing https://phabricator.wikimedia.org/T376735#10294225 and James +1'd my patch [21:50:17] I spoke with him in person so I know he's aware this is meant to be deployed everywhere [21:50:55] RESOLVED: MaxConntrack: Max conntrack at 97.86% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:50:59] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 40 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:51:01] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [21:53:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:54:45] arguably it's fine... [21:55:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:56:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bullseye [21:56:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:57:11] toyofuku: Do you care much about testing yours? [21:57:21] Yeah, but I can do it quickly [21:57:34] it's not quite finished the codemirror one yet [21:57:40] though I'm not fussed if the window overruns a little [21:57:44] I'll be around for a bit [21:57:55] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [21:58:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [21:58:31] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [21:58:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7004.magru.wmnet with OS bullseye [21:58:54] FIRING: PyBalBGPUnstable: PyBal BGP sessions on instance lvs7003 with peer 10.140.0.1 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=magru%20prometheus/ops&var-server=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [21:58:55] RECOVERY - PyBal backends health check on lvs7003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:59:10] !incidents [21:59:10] 5478 (ACKED) [10x] ProbeDown sre (probes/service magru) [21:59:10] 5480 (UNACKED) PyBalBGPUnstable lvs sre (lvs7003:9090 pybal 64600 10.140.0.1 magru) [21:59:11] 5477 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [21:59:11] 5475 (RESOLVED) [7x] ProbeDown sre (probes/service magru) [21:59:34] !ack 5480 [21:59:34] 5480 (ACKED) PyBalBGPUnstable lvs sre (lvs7003:9090 pybal 64600 10.140.0.1 magru) [21:59:43] ^ brett silence the above please [21:59:44] thanks [21:59:49] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359788 (10BCornwall) [21:59:53] 1 day, we cab revisit tomorrow [22:00:21] ack [22:00:27] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097591|Add BetaFeature for CodeMirror 6 (T376735)]] (duration: 40m 05s) [22:00:41] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [22:00:42] thanks again sukhe [22:00:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:00:59] your day's probably almost up, anything I should know for the next couple hours? [22:01:13] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1097484|Nov 26 2024: Vector 2022 Deployments (T379799)]] [22:01:18] T379799: Nov 26 2024: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379799 [22:01:29] magru is in pretty rough shape [22:01:39] lvs7003 had a broken link [22:01:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [22:02:03] (03PS1) 10MusikAnimal: Add CodeMirror to BetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098161 (https://phabricator.wikimedia.org/T376735) [22:02:32] if it's not too late, Reedy ^ [22:02:46] no pressure of course, you've been deploying for over an hour now! [22:03:08] silenced all PyBalBGPUnstable alerts in magru for a day [22:03:11] I've been sitting around watching scap do most of the work :D [22:03:19] lol [22:03:39] (03Abandoned) 10Reedy: InitialiseSettings.php: Reduce indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097474 (owner: 10Reedy) [22:04:26] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:01] !log reedy@deploy2002 jdlrobson, reedy: Backport for [[gerrit:1097484|Nov 26 2024: Vector 2022 Deployments (T379799)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:08:06] T379799: Nov 26 2024: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379799 [22:08:06] there we go [22:08:35] testing! [22:09:26] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:47] looks good, thank you! [22:11:50] sweet [22:11:52] !log reedy@deploy2002 jdlrobson, reedy: Continuing with sync [22:13:21] (03CR) 10Reedy: [C:03+2] Add CodeMirror to BetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098161 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [22:14:02] (03Merged) 10jenkins-bot: Add CodeMirror to BetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098161 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [22:14:50] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 262979 [22:15:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 262979 [22:17:06] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:19] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 192335440 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:18:19] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 16424 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:18:43] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [22:19:11] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:20:06] (03CR) 10Dzahn: [C:03+1] ci: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092863 (owner: 10Muehlenhoff) [22:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:21:05] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097484|Nov 26 2024: Vector 2022 Deployments (T379799)]] (duration: 19m 52s) [22:21:09] T379799: Nov 26 2024: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379799 [22:21:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [22:22:18] (03CR) 10Dzahn: "the reviewer-bot added me to this, just not sure exactly why it did that yet. The regexes I have might be a bit broad. respectfully removi" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [22:22:52] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1098161|Add CodeMirror to BetaFeaturesAllowList (T376735)]] [22:22:57] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [22:24:02] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 4800 [22:24:03] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns7002.wikimedia.org with OS bullseye [22:24:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [22:25:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7010.magru.wmnet with OS bullseye [22:25:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:25:50] Reedy: thank you!! [22:26:10] extended that silence to 1 day [22:26:30] (03PS4) 10Dzahn: scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:26:34] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1092841/4602/" [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:26:55] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 4800 [22:27:11] (03PS5) 10Dzahn: scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:27:46] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:28:27] (03CR) 10Paladox: "Dunno why I was added as a reviewer. I don't subscribe to this module on the mw page for the bot." [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:28:52] !log reedy@deploy2002 musikanimal, reedy: Backport for [[gerrit:1098161|Add CodeMirror to BetaFeaturesAllowList (T376735)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:28:57] take 2 [22:28:57] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [22:29:30] is there a specific debug server I should check? [22:30:23] mwdebug* ones should be fine [22:31:31] hmm I'm not seeing the syntax highlighting beta feature at https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures :( [22:31:46] I know this is often wrong but https://test.wikipedia.org/wiki/Special:Version is 1 commit behind for CodeMirror [22:32:08] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [22:33:03] (03PS4) 10Tim Starling: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) [22:34:32] are we able to verify https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/1097591 was deployed? [22:35:50] (03CR) 10Tim Starling: Introduce preinstall.dblist for wikis that haven't been installed yet (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [22:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:37:30] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [22:37:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS bullseye [22:37:53] The code is definitely live (confirmed on mwdebug2002) [22:37:54] aha! I see it on other group0 wikis https://ak.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [22:38:25] ohh of course, because CM6 is enabled on testwiki [22:38:31] sorry [22:38:35] heh [22:38:52] it's fine, I was just looking at $wgCodeMirrorV6 [22:39:07] does test2 run off the same config as test1? https://test2.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [22:39:35] "it depends" [22:40:08] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10359954 (10BCornwall) [22:40:18] for wgCodeMirrorV6, test2wiki doesn't [22:40:36] the beta feature is supposed to show for any wiki where wgCodeMirrorV6 is false [22:40:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs7001.magru.wmnet with OS bullseye [22:41:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:41:36] trying to find a group0 wiki that I can edit, lol [22:41:45] (that doesn't have CodeMirroV6 enabled) [22:41:57] https://versions.toolforge.org/ [22:42:06] yeah I've been going through them! [22:42:20] a lot (most?) are closed wikis [22:42:44] I see it on https://test.wikidata.org/wiki/Special:Preferences#mw-prefsection-betafeatures [22:42:51] >Improved Syntax Highlighting [22:43:42] (03CR) 10Dzahn: [C:03+1] "@Paladox Interesting! the same thing just happened to me. reviewer-bot added me to something that I did not expect to have in my regexes. " [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:44:28] reviewer-bot adds users to reviews on stuff that it normally does not or where it's not expected. interesting. maybe some edit on the special wiki page broke regexes. [22:44:39] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [22:45:13] alright! well I don't know why the beta feature isn't showing on test2, but testing working on testwikidata [22:45:26] I think we're good, Reedy. Thank you so much for your time today! :D [22:45:37] test2 isn't group0 it's group1 [22:45:41] because.. reasons :) [22:45:46] ahh! that explains it [22:45:57] !log reedy@deploy2002 musikanimal, reedy: Continuing with sync [22:46:20] I note because reasons could almost certainly be me years ago in this case [22:46:40] :-P [22:47:10] on a sandbox page on testwikidata [22:47:12] `asas {{foo}}` [22:47:19] the {{foo}} is purple [22:47:23] so vaguely WFM [22:48:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [22:48:06] (03CR) 10Dzahn: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [22:48:30] not sure what WFM is, but purple is the intended color [22:48:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [22:49:23] we're going to add themes eventually https://phabricator.wikimedia.org/T163533 [22:49:26] musikanimal: https://en.wiktionary.org/wiki/WFM [22:49:44] ah, thank you! I'm terrible at acronyms [22:50:00] should have known to check the ole Wiktionary first ;-) [22:50:10] Wiktionary best wiki:) [22:51:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [22:54:28] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098161|Add CodeMirror to BetaFeaturesAllowList (T376735)]] (duration: 31m 35s) [22:54:32] T376735: Release CodeMirror 6 as a beta feature - https://phabricator.wikimedia.org/T376735 [22:57:08] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [23:00:03] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [23:00:28] PROBLEM - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [23:03:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [23:04:52] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10GitLab (Infrastructure): Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10360018 (10Dzahn) I ran across this https://bacula.org/whitepapers/CloudBackup.pdf (e.g. `we included an S3 driver that is compatible wit... [23:07:17] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10360019 (10Dzahn) [23:08:03] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10360020 (10Dzahn) [23:12:33] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:13:18] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:13:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7010.magru.wmnet with OS bullseye [23:19:34] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [23:21:26] RECOVERY - Recursive DNS on 195.200.68.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:21:28] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:21:34] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:21:36] RECOVERY - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:21:42] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:22:57] FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:23:08] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:23:08] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:24:11] FIRING: [22x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:26:42] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:27:57] RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:28:09] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:28:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7001.magru.wmnet with OS bullseye [23:28:55] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:29:24] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [23:29:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns7002.wikimedia.org with OS bookworm [23:29:54] oh interesting, I didn't get the ncredir page through VO -- it looks like magru only, so I'm ignoring it, is that right brett? [23:30:10] correct, thanks! [23:30:15] 👍 [23:30:17] Sorry about these alerts, so annoying [23:33:34] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:33:36] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:35:01] expected [23:36:31] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10360135 (10BCornwall) [23:37:36] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:37:38] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:46:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:51:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures