[00:01:10] <Krinkle>	 zabe: hm.. how come this doesn't have a diffConfig diff? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/649609/
[00:04:38] <zabe>	 Krinkle: my guess is, because we are removing deploymentwiki from all-labs and buildConfigCache.php goes through that dblist
[00:05:17] <Krinkle>	 Hm.. but then we check out the parent commit and run it again, where it should create the tmp json files for that wiki, right?
[00:05:46] <Krinkle>	 I'd expect the diff to be that the file was effectively removed 
[00:06:03] <logmsgbot>	 !log zabe@deploy2002 Finished scap: T198673 (duration: 07m 25s)
[00:06:10] <stashbot>	 T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673
[00:06:21] <wikibugs>	 (03CR) 10Dzahn: "Hi, so you have switched commons-query.wikimedia.org away from miscweb* but never removed the puppetization there. This lead to confusion " [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[00:06:47] <Krinkle>	 Ah, I see. It will detect a file being added but not removed, because you can't "git add" the absence of a file.
[00:07:32] <Krinkle>	 the way it works is that it stages the "after" state, and then diffs against that.
[00:08:36] <Krinkle>	 so from the diff perspective the "new" file in the before state is untracked
[00:09:01] <Krinkle>	 and ignoring untracked files is important as we otherwise would also get all other gitgnored stuff in the diff
[00:09:51] <wikibugs>	 (03PS1) 10Superpes15: Revert "Change the trwiki logo with a temporary one (old vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892976 (https://phabricator.wikimedia.org/T329047)
[00:09:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Change the trwiki logo with a temporary one (old vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892976 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[00:10:09] <Krinkle>	 To simulate it locally: touch x && git add x && rm x && git diff; that shows 'x' being removed.
[00:10:10] <Krinkle>	 but
[00:10:19] <zabe>	 ah, good catch
[00:10:33] <Krinkle>	 touch x && git diff; won't show that 'x' is added
[00:12:25] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624
[00:13:28] <wikibugs>	 (03PS2) 10Zabe: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624
[00:13:40] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 (owner: 10Zabe)
[00:14:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 (owner: 10Zabe)
[00:16:16] <wikibugs>	 (03PS1) 10Dzahn: remove commons-query virtual host from httpd on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090)
[00:18:19] <wikibugs>	 (03PS2) 10Dzahn: remove commons-query virtual host from httpd on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090)
[00:18:34] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/893086/39875/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[00:22:11] <wikibugs>	 (03PS1) 10Dzahn: httpbb/miscweb: add missing/new virtual hosts to tests [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090)
[00:23:04] <wikibugs>	 (03CR) 10Dzahn: "[deploy1002:~] $ httpbb --hosts miscweb2002.codfw.wmnet ./test_miscweb.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[00:23:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb/miscweb: add missing/new virtual hosts to tests [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[00:26:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:27:16] <wikibugs>	 (03PS1) 10Krinkle: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088
[00:29:39] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:29:58] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:30:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:32:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "research.wikimedia.org - https://phabricator.wikimedia.org/T107389" [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[00:34:07] <wikibugs>	 (03CR) 10Dzahn: "It was hard to find this because there was no ticket about the creation of this. I added missing tests in https://gerrit.wikimedia.org/r/c" [puppet] - 10https://gerrit.wikimedia.org/r/724416 (owner: 10Muehlenhoff)
[00:34:48] <wikibugs>	 (03PS1) 10Superpes15: [trwiki] Reverting logo change for Vector 2022 and Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047)
[00:59:02] <wikibugs>	 (03PS1) 10Dzahn: switch (www).wikiworkshop.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893090 (https://phabricator.wikimedia.org/T330090)
[00:59:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] switch (www).wikiworkshop.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893090 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[01:16:14] <wikibugs>	 (03CR) 10Krinkle: "Test Plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle)
[01:16:56] <wikibugs>	 (03CR) 10Krinkle: "cc-ing Amir and Ahmon for awareness that this is/was a thing :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle)
[01:33:01] <wikibugs>	 (03PS3) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245
[01:41:11] <wikibugs>	 (03CR) 10Krinkle: "Verified as follows:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle)
[01:45:31] <wikibugs>	 (03PS2) 10Krinkle: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734
[01:45:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 (owner: 10Krinkle)
[01:46:20] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 (owner: 10Krinkle)
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:20] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)
[02:13:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Ah, this is nice, thanks Kosta" [puppet] - 10https://gerrit.wikimedia.org/r/893001 (owner: 10Kosta Harlan)
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:52:03] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack: rename 'user' role to 'member' [puppet] - 10https://gerrit.wikimedia.org/r/893036 (https://phabricator.wikimedia.org/T330759)
[02:52:05] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder policy.yaml: redefine xena_system_admin_or_project_member rule [puppet] - 10https://gerrit.wikimedia.org/r/893097 (https://phabricator.wikimedia.org/T330759)
[02:57:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder policy.yaml: redefine xena_system_admin_or_project_member rule [puppet] - 10https://gerrit.wikimedia.org/r/893097 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[04:12:10] <wikibugs>	 (03CR) 10Ebernhardson: "This should still be using miscweb, just not quite as directly. The requests go to a wcqs instance first, then the nginx there forwards ap" [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[04:24:13] <wikibugs>	 (03CR) 10Ebernhardson: [C: 04-1] "These are used, but not in the typical manner. The requests initially land at the wcqs instances directly so that it can put an oauth flow" [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[04:29:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10colewhite) There's a few ways we can cut these out.  Maybe first try something simple like: ` "must_not":...
[04:30:09] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10Observability-Logging, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10colewhite)
[04:58:01] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:58:25] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:58:35] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:01:54] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:02:08] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:11:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 216.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[05:26:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[05:30:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:21] <marostegui>	 !log Stop mysql on codfw sanitarium host db2095 (s2, s7, s6, s4) to clone db2187 T326596
[05:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:37:25] <stashbot>	 T326596: Productionize db218[567] - https://phabricator.wikimedia.org/T326596
[06:03:24] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:03:36] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:04:16] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:04:18] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:11:01] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2185 [puppet] - 10https://gerrit.wikimedia.org/r/893104 (https://phabricator.wikimedia.org/T326596)
[06:13:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2185 [puppet] - 10https://gerrit.wikimedia.org/r/893104 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[06:14:56] <marostegui>	 !log Stop MySQL on db2094 T330828
[06:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:02] <stashbot>	 T330828: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828
[06:16:54] <wikibugs>	 (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893105 (https://phabricator.wikimedia.org/T330827)
[06:17:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893105 (https://phabricator.wikimedia.org/T330827) (owner: 10Marostegui)
[06:26:58] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643)
[06:27:38] <wikibugs>	 (03PS2) 10Marostegui: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643)
[06:28:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui)
[06:28:55] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui)
[06:30:18] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:34:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:16] <wikibugs>	 10SRE, 10Service-deployment-requests: Kaynak - https://phabricator.wikimedia.org/T330830 (10Metin6201)
[06:37:40] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:32] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:41:38] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:42:28] <wikibugs>	 (03PS1) 10ArielGlenn: make dumpsdata1004 the xmlfallback host, with dumpsdata1001 as xml spare [puppet] - 10https://gerrit.wikimedia.org/r/893265 (https://phabricator.wikimedia.org/T330573)
[06:45:08] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:45:28] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:46:10] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:55:41] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/893265/39876/ looks as expected. We have the latest data rsynced from dumpsdata1003 and 1002, s" [puppet] - 10https://gerrit.wikimedia.org/r/893265 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn)
[06:57:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0700)
[07:03:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 221.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[07:03:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:05:06] <wikibugs>	 (03Abandoned) 10ArielGlenn: delay start of the March xml dump rn unti the evening [puppet] - 10https://gerrit.wikimedia.org/r/893055 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn)
[07:12:34] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:12:44] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:13:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:13:25] <wikibugs>	 (03PS2) 10ArielGlenn: Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573)
[07:13:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn)
[07:13:48] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:16:26] <wikibugs>	 (03PS3) 10ArielGlenn: Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573)
[07:23:06] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adap
[07:23:06] <icinga-wm>	 nks to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[07:24:38] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[07:26:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:27:40] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:27:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) Wonderful, we have claimed them already :-) Thank you!
[07:29:56] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) is CRITICAL: Test Suggest a target title for the given source title and language pairs r
[07:29:56] <icinga-wm>	 the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[07:31:38] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[07:33:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[07:44:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[07:47:38] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893390 (https://phabricator.wikimedia.org/T326596)
[07:48:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893390 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[07:49:37] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db2187 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/893391 (https://phabricator.wikimedia.org/T326596)
[07:50:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ores: change monitoring for the service [puppet] - 10https://gerrit.wikimedia.org/r/893008 (owner: 10Elukey)
[07:50:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2187 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/893391 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[07:51:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:51:40] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:55:24] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::ml_etcd: prepare eqiad cluster for reimage/boostrap [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758)
[07:57:01] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[07:57:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39877/console" [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[07:58:58] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::ml_etcd: prepare eqiad cluster for reimage/boostrap [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0800).
[08:00:05] <jouncebot>	 aharoni: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:02:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:02:22] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:05:46] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: T330758
[08:10:47] <stashbot>	 T330758: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758
[08:10:52] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: T330758
[08:11:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1003.eqiad.wmnet with OS bullseye
[08:11:45] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1002.eqiad.wmnet with OS bullseye
[08:11:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1001.eqiad.wmnet with OS bullseye
[08:14:56] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2184.codfw.wmnet with reason: 10.6 recovery
[08:15:09] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2184.codfw.wmnet with reason: 10.6 recovery
[08:15:26] <icinga-wm>	 PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[08:16:19] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Remove db2094 [puppet] - 10https://gerrit.wikimedia.org/r/893395 (https://phabricator.wikimedia.org/T330828)
[08:16:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Install pbuilder hook for ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff)
[08:16:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove db2094 [puppet] - 10https://gerrit.wikimedia.org/r/893395 (https://phabricator.wikimedia.org/T330828) (owner: 10Marostegui)
[08:16:57] <wikibugs>	 (03PS1) 10Ayounsi: Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396
[08:19:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the addition to the ops group needs on-patch or on-tasj approval by Joanna, though." [puppet] - 10https://gerrit.wikimedia.org/r/893396 (owner: 10Ayounsi)
[08:21:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1002.eqiad.wmnet with reason: host reimage
[08:21:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1003.eqiad.wmnet with reason: host reimage
[08:21:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1001.eqiad.wmnet with reason: host reimage
[08:24:00] <icinga-wm>	 PROBLEM - Check systemd state on cp5023 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:24:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1002.eqiad.wmnet with reason: host reimage
[08:24:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:25:20] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[08:25:22] <icinga-wm>	 RECOVERY - Check systemd state on cp5023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:07] <jynus>	 !log stopping db2184 for testing mariadb 10.6 recovery workflow T319383
[08:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:12] <stashbot>	 T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction") - https://phabricator.wikimedia.org/T319383
[08:26:22] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 482 days) https://wikitech.wikimedia.org/wiki/Logs
[08:26:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1003.eqiad.wmnet with reason: host reimage
[08:27:48] <aharoni>	 urbanecm, Amir1, sorry, couldn't connect earlier. Is it still possible to do the backport of those namespace patches?
[08:28:52] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1001.eqiad.wmnet with reason: host reimage
[08:29:48] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:31:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for echetty [puppet] - 10https://gerrit.wikimedia.org/r/893397
[08:32:43] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10ayounsi) If there is an existing open task it will append to it. Here it was a coincidence that it stopped seeing the VCP issue at the same time as started to see the db2099 issue.
[08:33:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for echetty [puppet] - 10https://gerrit.wikimedia.org/r/893397 (owner: 10Muehlenhoff)
[08:34:19] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Emil Chetty out of all services on: 1110 hosts
[08:35:07] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Emil Chetty out of all services on: 1110 hosts
[08:36:19] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Emil Chetty out of all services on: 918 hosts
[08:37:43] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Emil Chetty out of all services on: 918 hosts
[08:40:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1001.eqiad.wmnet with OS bullseye
[08:41:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:41:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1002.eqiad.wmnet with OS bullseye
[08:41:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:41:46] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1003.eqiad.wmnet with OS bullseye
[08:42:49] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23
[08:43:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 240.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[08:45:01] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::{master,worker}: update ml-serve-eqiad to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[08:45:24] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl1001.eqiad.wmnet with OS bullseye
[08:51:04] <moritzm>	 !log upgrade mw/eqiad to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270
[08:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:55] <wikibugs>	 (03PS8) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309)
[08:53:52] <wikibugs>	 (03CR) 10Vgutierrez: acme_chief: support several passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez)
[08:56:29] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage
[08:57:40] <wikibugs>	 (03CR) 10Hashar: "CI fails cause the code relies on the `semver-cli` command ( https://github.com/davidrjonas/semver-cli ) which is introduced by https://ge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[08:58:53] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage
[09:00:05] <jouncebot>	 jnuche and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0900).
[09:01:06] <hashar>	 o/
[09:03:08] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383)
[09:05:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Thank you for the reviews!" [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:05:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:06:09] <jnuche>	 hi, will deploy in 5 mins
[09:07:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >  @jbond could you have a look at this anytime soon? @demon from my side the change is very minimal, just let me know if you wo...
[09:13:24] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588)
[09:13:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot)
[09:14:07] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot)
[09:15:04] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet with OS bullseye
[09:15:28] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl1002.eqiad.wmnet with OS bullseye
[09:16:57] <wikibugs>	 (03CR) 10Jcrespo: "Moritz- question- is there a way to mark a file to not check its license? I don't want to have a license on the production file itself, bu" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:22:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39878/console" [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:23:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 208.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[09:23:06] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.25  refs T325588
[09:23:11] <stashbot>	 T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588
[09:24:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: refactor blackbox configuration [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:26:08] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Add SPDX exception for myloader_defaults_override.cnf [puppet] - 10https://gerrit.wikimedia.org/r/893405
[09:26:34] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage
[09:27:03] <wikibugs>	 (03CR) 10Muehlenhoff: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:30:55] <logmsgbot>	 !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.40.0-wmf.25  refs T325588 (duration: 07m 48s)
[09:31:00] <stashbot>	 T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588
[09:31:08] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage
[09:31:34] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro
[09:31:54] <wikibugs>	 (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:33:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/892587 (owner: 10Dzahn)
[09:33:39] <wikibugs>	 (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:34:10] <icinga-wm>	 PROBLEM - Check systemd state on mw1428 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39879/console" [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:37:44] <wikibugs>	 (03CR) 10Muehlenhoff: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:38:53] <moritzm>	 !log installing tiff security updates
[09:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:35] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=eqiad
[09:40:48] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383)
[09:41:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:41:49] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8309
[09:42:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez)
[09:42:48] <wikibugs>	 (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:43:43] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383)
[09:44:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:44:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182)
[09:44:52] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182)
[09:46:03] <wikibugs>	 (03CR) 10Jcrespo: "This was not an arbitrary request- it was important for me that this workaround was as simple in production as possible (plus myloader tre" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:46:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:46:26] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:47:21] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383)
[09:47:24] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet with OS bullseye
[09:48:13] <wikibugs>	 (03Abandoned) 10Jcrespo: Add SPDX exception for myloader_defaults_override.cnf [puppet] - 10https://gerrit.wikimedia.org/r/893405 (owner: 10Muehlenhoff)
[09:51:36] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8309
[09:52:07] <wikibugs>	 (03PS34) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[09:52:46] <wikibugs>	 (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/893400/39880/dbprov1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[09:54:50] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182)
[09:54:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: update pint listen port [puppet] - 10https://gerrit.wikimedia.org/r/893406 (https://phabricator.wikimedia.org/T309182)
[09:54:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182)
[09:55:26] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:17] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bullseye
[09:57:56] <marostegui>	 !log Stop db1117:3325 and db1176 T329478
[09:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:00] <stashbot>	 T329478: Move db1176 to m5 - https://phabricator.wikimedia.org/T329478
[09:58:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bullseye
[09:59:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update pint listen port [puppet] - 10https://gerrit.wikimedia.org/r/893406 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:59:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:59:13] <marostegui>	 There will be haproxy irc alerts for the above operation on db1117
[09:59:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bullseye
[09:59:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bullseye
[10:00:32] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:01:43] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1176 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/893408 (https://phabricator.wikimedia.org/T329478)
[10:02:11] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1010.eqiad.wmnet
[10:02:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1176 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/893408 (https://phabricator.wikimedia.org/T329478) (owner: 10Marostegui)
[10:02:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:03:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye
[10:03:06] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Actually, this doesn't work, we still get lock with this. Retrying with a different syntax." [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[10:04:40] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:06:25] <wikibugs>	 (03PS5) 10Hashar: contint: regroup common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056)
[10:06:31] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[10:06:50] <wikibugs>	 (03PS4) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487
[10:07:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 (owner: 10Volans)
[10:08:12] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:24] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:09:06] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 (owner: 10Volans)
[10:09:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Giving up because this doesn't work at all. (also tested alternative syntax [myloader]" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[10:11:50] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[10:11:52] <wikibugs>	 (03Abandoned) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[10:13:19] <wikibugs>	 (03CR) 10Hashar: "Compiler result https://puppet-compiler.wmflabs.org/output/887738/1642/" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[10:13:27] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage
[10:13:40] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:49] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Revert "contint: remove obsolete firewall rules from labs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar)
[10:13:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[10:14:17] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[10:14:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[10:16:51] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[10:17:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage
[10:17:48] <wikibugs>	 (03PS1) 10Jbond: apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409
[10:18:04] <wikibugs>	 (03CR) 10Jobo: [C: 03+2] Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396 (owner: 10Ayounsi)
[10:18:13] <wikibugs>	 (03PS1) 10Majavah: P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410
[10:19:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39881/console" [puppet] - 10https://gerrit.wikimedia.org/r/893409 (owner: 10Jbond)
[10:19:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[10:19:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410 (owner: 10Majavah)
[10:21:25] <wikibugs>	 (03CR) 10Muehlenhoff: "Or we just move back DNS? This will probably only cause confusion and apt.w.o is pretty unrelated to the wider DC switchover? (like idp.w." [puppet] - 10https://gerrit.wikimedia.org/r/893409 (owner: 10Jbond)
[10:22:25] <wikibugs>	 (03PS2) 10Jbond: apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907)
[10:25:36] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.dns.netbox
[10:26:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410 (owner: 10Majavah)
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:28:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond)
[10:29:22] <icinga-wm>	 PROBLEM - Host ml-serve1004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:29:34] <icinga-wm>	 PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100%
[10:29:38] <wikibugs>	 (03CR) 10Clément Goubert: apt: swap active and failover apt servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond)
[10:30:08] <icinga-wm>	 RECOVERY - Host ml-serve1004 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[10:30:55] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bullseye
[10:32:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage
[10:33:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bullseye
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:35:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage
[10:35:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1002.eqiad.wmnet with OS bullseye
[10:37:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bullseye
[10:39:54] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:41:18] <wikibugs>	 (03PS1) 10Hashar: contint: Jenkins master > controller [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646)
[10:42:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10jbond) p:05Triage→03Medium
[10:43:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] apt: swap active and failover apt servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond)
[10:43:57] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[10:44:04] <wikibugs>	 (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:44:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[10:47:57] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[10:49:53] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey)
[10:50:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond)
[10:54:56] <wikibugs>	 (03PS35) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[10:55:31] <wikibugs>	 (03PS2) 10Ayounsi: Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396
[10:56:59] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001"
[10:57:32] <moritzm>	 !log upgrade cloudweb to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270
[10:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:48] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:58:51] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:59:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:59:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:59:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:59:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:59:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:59:50] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:59:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1100)
[11:00:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:01:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:01:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:01:31] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[11:01:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:01:37] <icinga-wm>	 RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:01:43] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:02:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:02:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:02:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:03:06] <wikibugs>	 (03PS36) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[11:03:09] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:03:16] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:03:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:03:27] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:03:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:03:52] <stashbot>	 elukey@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[11:04:00] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1005.eqiad.wmnet with OS bullseye
[11:04:12] <taavi>	 hm, why did that fail?
[11:04:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye
[11:04:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye
[11:05:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bullseye
[11:05:47] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on an-worker1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Nicolas Fraison Reboot https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:07:14] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001"
[11:07:15] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:07:16] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcephosd1010.eqiad.wmnet
[11:07:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:07:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:07:45] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[11:08:11] <icinga-wm>	 PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:08:41] <icinga-wm>	 PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:08:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:09:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:12:23] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:12:45] <claime>	 Checking SAL logging
[11:14:06] <wikibugs>	 (03PS1) 10Elukey: kserve: add replicas setting for Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542)
[11:15:08] <claime>	 taavi: Very weird, everything got logged except the PASS 
[11:15:26] <taavi>	 yeah, maybe a temporary fail
[11:16:20] <wikibugs>	 (03CR) 10Klausman: kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey)
[11:17:25] <wikibugs>	 (03CR) 10Elukey: kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey)
[11:20:33] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Drop RBAC rules for deprecated resources [puppet] - 10https://gerrit.wikimedia.org/r/889836 (https://phabricator.wikimedia.org/T329869) (owner: 10Majavah)
[11:23:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm ping me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[11:27:21] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Add db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893420 (https://phabricator.wikimedia.org/T326596)
[11:27:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Add db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893420 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[11:27:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: use api gateway for jobs cli [puppet] - 10https://gerrit.wikimedia.org/r/892370 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[11:28:20] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10jbond)
[11:28:43] <icinga-wm>	 PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:33:36] <wikibugs>	 (03PS1) 10Jbond: puppet::agent: Add external facts directory [puppet] - 10https://gerrit.wikimedia.org/r/893421
[11:33:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet::agent: Add external facts directory [puppet] - 10https://gerrit.wikimedia.org/r/893421 (owner: 10Jbond)
[11:34:54] <jbond>	 arturo: happy for me to merge yours change
[11:35:25] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[11:36:20] <arturo>	 jbond: sorry, please go!
[11:37:28] <jbond>	 don
[11:37:30] <jbond>	 e
[11:38:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1ing to avoid accidental merge until the dependent restbase change gets merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[11:39:11] <wikibugs>	 (03CR) 10Marostegui: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo)
[11:40:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Noting also that this release bumps mathoid to node16 (see https://gerrit.wikimedia.org/r/c/mediawiki/services/mathoid/+/866666)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[11:40:23] <icinga-wm>	 RECOVERY - Check systemd state on mw1428 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:05] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:49] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:57] <wikibugs>	 (03PS3) 10Hnowlan: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967)
[11:46:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[11:49:48] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:58] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey)
[11:54:48] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:58:05] <wikibugs>	 (03PS3) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967)
[11:58:47] <moritzm>	 !log upgrade parse/eqiad to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270
[11:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:54] <wikibugs>	 (03CR) 10Btullis: "This is the SparkApplication that I have been using to test this chart. Note the use of the `spark-driver` serviceAccount." [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[12:00:55] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[12:07:26] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Enforce passive_hosts as a list of FQDN [puppet] - 10https://gerrit.wikimedia.org/r/893425 (https://phabricator.wikimedia.org/T321309)
[12:10:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182)
[12:10:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182)
[12:11:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[12:11:41] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39882/console" [puppet] - 10https://gerrit.wikimedia.org/r/893425 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez)
[12:17:37] <wikibugs>	 (03PS37) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[12:18:12] <wikibugs>	 (03PS2) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967)
[12:19:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[12:20:59] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle)
[12:22:17] <wikibugs>	 (03PS3) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967)
[12:23:26] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182)
[12:23:28] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182)
[12:25:36] <wikibugs>	 (03PS2) 10Ladsgroup: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle)
[12:26:57] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[12:28:05] <moritzm>	 !log upgrade mwmaint to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270
[12:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39884/console" [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[12:36:51] <wikibugs>	 (03CR) 10Ottomata: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[12:38:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please @andrew double check." [puppet] - 10https://gerrit.wikimedia.org/r/892944 (owner: 10Majavah)
[12:38:26] <wikibugs>	 (03PS1) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849)
[12:39:42] <wikibugs>	 (03PS1) 10Marostegui: db2183: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/893469 (https://phabricator.wikimedia.org/T330861)
[12:40:03] <wikibugs>	 (03PS2) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849)
[12:40:14] <marostegui>	 !log Upgrade db2183 to 10.6 T330861
[12:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:20] <stashbot>	 T330861: Migrate backup1-* masters to MariaDB 10.6 - https://phabricator.wikimedia.org/T330861
[12:40:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2183: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/893469 (https://phabricator.wikimedia.org/T330861) (owner: 10Marostegui)
[12:42:44] <wikibugs>	 (03PS3) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849)
[12:46:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39887/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[12:51:04] <wikibugs>	 (03PS1) 10Jbond: confd::file: drop relative prefix [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849)
[12:54:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39888/console" [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[12:54:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "pcc shows a diff to core_resources but its only white-space" [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[12:54:43] <wikibugs>	 (03PS1) 10Jaime Nuche: scap bootstrap: use new installation mechanism [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622)
[13:05:18] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: drop standalone jobs ingress [puppet] - 10https://gerrit.wikimedia.org/r/893474 (https://phabricator.wikimedia.org/T329443)
[13:08:32] <wikibugs>	 (03PS2) 10Krinkle: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088
[13:08:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle)
[13:09:18] <wikibugs>	 (03Merged) 10jenkins-bot: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle)
[13:09:35] <wikibugs>	 (03PS3) 10Krinkle: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873)
[13:09:39] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle)
[13:09:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw
[13:10:05] <claime>	 !log Adding scheduled maintenance for switchover to statuspage - T327920
[13:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:09] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[13:10:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle)
[13:11:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw
[13:11:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::k8s::haproxy: drop standalone jobs ingress [puppet] - 10https://gerrit.wikimedia.org/r/893474 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[13:11:49] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[13:11:50] <TheresNoTime>	 jouncebot: nowandnext
[13:11:50] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 48 minute(s)
[13:11:51] <jouncebot>	 In 0 hour(s) and 48 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400)
[13:12:10] <TheresNoTime>	 48 minutes to go, how exciting 
[13:12:24] <Krinkle>	 TheresNoTime: I'm pushing some minor config clean up pathces a.t.m.
[13:12:32] <Krinkle>	 I can stop though
[13:13:21] <TheresNoTime>	 I was just curious how long there was until the switch, not my call on if you need to stop :D
[13:13:22] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[13:13:34] <Krinkle>	 k :)
[13:14:11] <wikibugs>	 (03PS2) 10Krinkle: filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy)
[13:14:15] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy)
[13:14:28] * Krinkle testing on mwdebug2001
[13:14:57] <wikibugs>	 (03Merged) 10jenkins-bot: filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy)
[13:15:06] <wikibugs>	 (03PS3) 10Krinkle: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy)
[13:15:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy)
[13:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy)
[13:16:37] <claime>	 Krinkle: I will be locking scap deployments at 1330UTC
[13:17:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad
[13:17:09] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10Volans) We  should find a standard setup for those use cases, I can see Netbox having exactly the same issue/requirement (some puppet-driver resourc...
[13:17:50] <claime>	 I would ask that everybody refrain from running cookbooks or other starting at 1330UTC too
[13:18:01] <wikibugs>	 (03CR) 10Majavah: "This could also help drop some local hacks from the deployment-prep puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[13:18:06] <claime>	 (since I can't lock that down, I'm counting on y'all)
[13:18:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad
[13:18:32] <Krinkle>	 ack
[13:19:31] <logmsgbot>	 !log krinkle@deploy2002 Synchronized wmf-config/: Ie063fbf91d5b41e0 - Remove config for former Rdbms logging (duration: 07m 39s)
[13:19:46] <wikibugs>	 (03CR) 10Majavah: kubeadm: update wmcs-k8s-get-cert for certificates/v1 [puppet] - 10https://gerrit.wikimedia.org/r/890502 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah)
[13:19:56] <volans>	 same goes for helm chart deploys, homer runs
[13:20:23] <claime>	 yes
[13:20:24] <moritzm>	 ack
[13:21:04] <volans>	 I wonder if also netbox changes, maybe a shoutout to dc-ops might be worth
[13:21:15] <claime>	 yep. doing
[13:21:27] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Tested by re-rendering https://en.wikipedia.org/wiki/ImageMagick and by purging thumbs of https://commons.wikimedia.org/wiki/File:ImageMag" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy)
[13:23:00] * Krinkle is done
[13:23:04] <claime>	 Thanks <3
[13:23:53] <Krinkle>	 err. it's taking af ew more minutes t finish the sync actually, my bad.
[13:24:08] <Krinkle>	 I'm done testing the second change, should be done in ~5min
[13:24:15] <Krinkle>	 syncs take longer than they used to
[13:24:34] <claime>	 It's ok, I should have communicated that I wanted a larger berth for deployments here and not just in -sre
[13:25:01] <volans>	 it's ok Krinkle, this way we can use you as scapegoat if the need arises :-P
[13:28:03] <TheresNoTime>	 now *that's* planning ahead :>
[13:30:34] <claime>	 Holding until last deployment is done
[13:30:54] <logmsgbot>	 !log krinkle@deploy2002 Synchronized wmf-config/: I3beefbf4ee3d66 filebackend cleanup (duration: 07m 13s)
[13:31:02] <Krinkle>	 right on the clock
[13:31:04] <claime>	 !log Locking scap deployments for datacenter switchover - T327920
[13:31:05] * Krinkle is actually done
[13:31:06] <_joe_>	 :)
[13:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:09] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[13:31:31] <_joe_>	 action item: add scap locking to the switchdc cookbook
[13:32:19] <claime>	 _joe_: I added it to my checklist, which we'll use as base for improvements
[13:33:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) These hosts are correctly added to the partman recipe regex.
[13:34:10] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[13:37:35] <claime>	 I think we're all set, now we wait
[13:37:50] <hashar>	 jnuche: how the train went this morning? ;)
[13:37:59] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.dns.netbox
[13:38:07] <_joe_>	 claime: you can start with step 0 whenever you want btw
[13:38:13] <claime>	 dcaro: wth
[13:39:15] <jnuche>	 hashar: went well, logged an already existing issue not related to  1.40.0-wmf.25
[13:39:18] <claime>	 Starting step 0, everybody good?
[13:39:18] <jnuche>	 other than that logs are quiet
[13:39:37] <_joe_>	 claime: +1
[13:39:46] <_joe_>	 you can skip the warmup ofc
[13:40:10] <claime>	 !log Starting mediawiki datacenter switchover step 0 - T327920
[13:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:15] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[13:40:16] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1015 to rack F4 - dcaro@cumin1001"
[13:40:28] <marostegui>	 dcaro: Please stop any changes, we are starting with the DC switch
[13:40:36] <volans>	 Executing cookbook sre.switchdc.mediawiki with args: ['eqiad', 'codfw'] claime +1 for ARGS :D
[13:40:55] <_joe_>	 +1 here too
[13:40:56] * claime deep breaths
[13:41:11] <_joe_>	 🤠 it
[13:41:13] <claime>	 Waiting on puppet.sync-netbox
[13:41:22] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1015 to rack F4 - dcaro@cumin1001"
[13:41:22] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:41:26] <claime>	 Let's go
[13:41:28] <dcaro>	 \facepalm, ack, will stop doing anything
[13:41:33] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
[13:41:35] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
[13:41:36] <_joe_>	 dcaro: thanks :)
[13:41:41] <hashar>	 jnuche: excellent :-]
[13:41:45] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks
[13:41:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0)
[13:41:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[13:42:02] <claime>	 Skipping warmup
[13:42:46] <claime>	 5 minutes mandatory wait for TTL change
[13:42:59] <_joe_>	 yes
[13:43:18] <claime>	 Will do a GO/NOGO check before disabling maintenance
[13:43:25] <_joe_>	 then the steps that need to happen in sequence start, so yeah
[13:43:27] <claime>	 And a final GO/NOGO before entering RO phase
[13:43:35] <marostegui>	 +1
[13:43:35] <akosiaris>	 OK
[13:43:37] <_joe_>	 ack
[13:43:54] <claime>	 In any case, I will not enter RO before 1400
[13:43:59] <volans>	 _joe_: do you recall why we don't just nuke the recursors's cache for those records instead of the sleep?
[13:44:04] <marostegui>	 claime: cool
[13:44:08] <volans>	 claime: +1
[13:44:10] <claime>	 volans: I vote tech debt
[13:44:19] <_joe_>	 volans: gives a nice breathing room before step 1
[13:44:21] <_joe_>	 :D
[13:44:24] <claime>	 But also yeah
[13:44:25] <volans>	 lol
[13:44:26] <claime>	 Breather
[13:44:43] <_joe_>	 we did discuss removing it, decided against it
[13:45:07] <volans>	 ck
[13:45:09] <volans>	 *ack
[13:45:11] <akosiaris>	 Getting my pot of tea ready
[13:45:19] <marostegui>	 akosiaris: No gyros?
[13:45:55] <akosiaris>	 Shrimp and Taramasalata actually today
[13:46:10] <akosiaris>	 I had gyros a couple of days ago though
[13:46:14] <marostegui>	 :(
[13:46:20] <wikibugs>	 (03CR) 10Jaime Nuche: "Added patch to Puppet deployment window tomorrow Thursday, after train and DC switchover are complete: https://wikitech.wikimedia.org/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[13:46:21] <claime>	 I had gyros yesterday
[13:46:25] <claime>	 Well, kebab
[13:46:27] <claime>	 same difference
[13:46:30] <claime>	 :p
[13:46:34] * _joe_ playing "Ain't no mountain high enough"
[13:46:59] <_joe_>	 (Marvin Gaye and Tammi Terrell, fyi)
[13:47:03] <taavi>	 recommended listening during the switchover: https://en.wikipedia.org/wiki/Listen_to_Wikipedia
[13:47:12] <_joe_>	 and that yes
[13:47:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[13:47:38] <claime>	 TTLs set
[13:47:45] <claime>	 GO/NOGO maintenance stop
[13:48:16] <marostegui>	 go
[13:48:39] <_joe_>	 go
[13:48:52] <claime>	 Heads up Emperor jbond 
[13:49:09] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[13:49:21] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99)
[13:49:34] <claime>	 great
[13:49:44] <claime>	 ----- OUTPUT of 'systemctl list-u...t 255 || exit 0'' -----                                                                                   
[13:49:46] <_joe_>	 claime: do not despair
[13:49:46] <claime>	 static 
[13:49:51] <volans>	 node=mwmaint1002.eqiad.wmnet, rc=124, command='systemctl list-units 'mediawiki_job_*' --no-legend | awk '{print $1}' | xargs -n 1 sh -c 'systemctl is-enabled $0 && exit 255 || exit 0''
[13:50:14] <_joe_>	 ah we have some failed units
[13:50:15] <claime>	 f-
[13:50:17] <_joe_>	 lol
[13:51:08] <_joe_>	 ok so
[13:51:13] <_joe_>	 all timers seem to be down now
[13:51:31] <claime>	 fails reset
[13:51:37] <claime>	 re-running step
[13:51:39] <volans>	 we can re-run it (or all for that matters) as they should be idempotent
[13:51:40] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[13:51:40] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1007.eqiad.wmnet with OS bullseye
[13:52:06] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0)
[13:52:10] <volans>	 yay
[13:52:11] <claime>	 There.
[13:52:12] <_joe_>	 cool
[13:52:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye
[13:52:19] <claime>	 Breathing until 1400
[13:52:22] <claime>	 elukey: please stop
[13:52:22] <_joe_>	 elukey: ahem
[13:53:00] <_joe_>	 ok, I'd say we are ok to go personally
[13:53:15] <_joe_>	 should we wait for 15:00 ?
[13:53:17] <claime>	 I think so too, but holding until the actual maintenance time.
[13:53:19] <claime>	 Yes.
[13:53:25] <_joe_>	 booo :D
[13:53:30] <claime>	 I have a planned maintenance scheduled to go up in statuspage
[13:53:35] <claime>	 I'd rather respect it
[13:53:37] <claime>	 :p
[13:53:38] <marostegui>	 yeah let's wait for 14:00 utc
[13:53:40] <_joe_>	 yes yes I'm joking
[13:53:45] <claime>	 ik ik
[13:53:52] <akosiaris>	 stick to the plan :P
[13:53:57] <_joe_>	 I was playing on the cowboy theme
[13:54:04] <claime>	 yeehaw
[13:54:05] <_joe_>	 but, I was serious on the GO
[13:54:09] <akosiaris>	 not the best time ? 
[13:54:12] <_joe_>	 I think we're set
[13:54:19] <marostegui>	 el-p
[13:54:22] <elukey>	 claime: ah snap sorry, I retried since it was failed and realized
[13:54:25] <claime>	 akosiaris: it's ok, I'm deep breathing
[13:54:28] <claime>	 :P
[13:56:07] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1007.eqiad.wmnet with OS bullseye
[13:56:13] <elukey>	 done :)
[13:56:18] <claime>	 ack
[13:56:58] <akosiaris>	 is anyone recording listen to wikipedia ?
[13:57:13] <_joe_>	 no I'm just listening
[13:57:21] <_joe_>	 it's the best feedback about ro-mode
[13:57:26] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF)
[13:57:28] <_joe_>	 without needing to actually try an edit
[13:57:54] <claime>	 T-3 minutes, final GO/NOGO check before read-only
[13:58:03] <marostegui>	 I am ready
[13:58:03] <_joe_>	 and btw, the trick is to also select one wiki per section, so I added wikidata, itwiki, frwiki, dewiki, eswiki, ruwiki
[13:58:39] <marostegui>	 _joe_: I have enwiki
[13:58:45] <marostegui>	 And commons
[13:59:34] <claime>	 Everybody set ?
[13:59:37] <marostegui>	 yep
[13:59:39] <volans>	 ready
[13:59:57] <_joe_>	 so the read-only set will take 15-30 seconds to propagate, but we can proceed with setting the dbs readonly 
[14:00:03] <claime>	 here we go
[14:00:04] <jouncebot>	 claime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Datacenter Switchover - Mediawiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400).
[14:00:10] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
[14:00:10] <logmsgbot>	 !log cgoubert@cumin1001 MediaWiki read-only period starts at: 2023-03-01 14:00:10.075167
[14:00:22] <_joe_>	 ah nevermind, it checks itself
[14:00:29] <volans>	 silence
[14:00:32] <_joe_>	 silence here too
[14:00:39] <marostegui>	 same 
[14:00:39] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0)
[14:00:40] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:00:40] <_joe_>	 well almost silence
[14:00:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[14:00:42] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:00:58] <marostegui>	 ^ expected
[14:01:02] <marostegui>	 wikitech is on s6
[14:01:06] <_joe_>	 yes, sigh
[14:01:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0)
[14:01:16] <_joe_>	 we forgot this
[14:01:16] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki
[14:01:17] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:29] <_joe_>	 switching mediawiki
[14:01:42] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: apt: improve apt failover orchestration - https://phabricator.wikimedia.org/T330849 (10Aklapper)
[14:01:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0)
[14:01:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[14:01:57] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:59] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:00] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[14:02:01] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:02] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:02:03] <stashbot>	 cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:09] <akosiaris>	 sound
[14:02:09] <logmsgbot>	 !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-03-01 14:02:09.272468
[14:02:09] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[14:02:10] <volans>	 sounds
[14:02:12] <marostegui>	 enwiki codfw master receiving writes
[14:02:14] <claime>	 Out
[14:02:20] * claime breathes
[14:02:23] <_joe_>	 wooo
[14:02:24] <marostegui>	 same with commons
[14:02:27] <jynus>	 edit when trough on eswiki (s7) too
[14:02:36] <claime>	 Starting post-RO steps
[14:02:37] <_joe_>	 wikidata too
[14:02:43] <akosiaris>	 👏 👏 👏
[14:02:44] <_joe_>	 s3 and s5 and s4 too
[14:02:44] <volans>	 119s of RO time
[14:02:47] <Amir1>	 niiiice
[14:02:51] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners
[14:02:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0)
[14:03:00] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
[14:03:02] <jynus>	 not finished
[14:03:23] <marostegui>	 anyone monitoring fatals?
[14:03:26] <_joe_>	 jynus: what?
[14:03:39] <jynus>	 _joe_: I mean we are not done and not celebrate early
[14:03:52] <akosiaris>	 don't break the mood, we know 
[14:03:58] <jynus>	 :-)
[14:04:05] <claime>	 We're out of the hairy part though
[14:04:05] <_joe_>	 fatals are goiing down
[14:04:10] <_joe_>	 yes we are
[14:04:15] <_joe_>	 and the latency is ok too
[14:04:17] <marostegui>	 _joe_: thanks
[14:04:17] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:04:28] <_joe_>	 uhm
[14:04:31] <_joe_>	 jobrunners
[14:04:33] <claime>	 POSTS going up on appservers
[14:04:39] <_joe_>	 let me check the jobrunners for a sec
[14:04:43] <Amir1>	 the dashboard to check just in case https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors
[14:04:43] <claime>	 _joe_: envoy restarts I bet
[14:04:52] <taavi>	 wikitech edits work fine, and its weird job running setup works too
[14:04:56] <claime>	 Documentation says to expect 500s
[14:05:32] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
[14:05:36] <jynus>	 It was just a spike of Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes now gone
[14:05:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl
[14:05:39] <_joe_>	 jobs have moved to codfw
[14:05:48] <volans>	 [for later] output of 08-start-maintenance could be improved :-P
[14:05:49] <_joe_>	 jynus: yeah just a bit late
[14:05:57] <Amir1>	 insertation works but I'm not seeing processing yet https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1
[14:06:01] <marostegui>	 SAL working fine
[14:06:09] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0)
[14:06:10] <claime>	 Once TTLs are restored, I'll merge the DNS change
[14:06:20] <wikibugs>	 (03PS2) 10Clément Goubert: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920)
[14:06:31] <wikibugs>	 (03CR) 10Clément Goubert: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[14:06:37] <claime>	 waiting on jenkings
[14:06:39] <claime>	 -g
[14:06:57] <_joe_>	 Amir1: the graphs are broken for some reason
[14:06:58] <marostegui>	 Someone needs to check why SAL works fine but there's no response to !log irc commands
[14:07:03] <Amir1>	 sigh
[14:07:13] <_joe_>	 it's tcpircbot I guess marostegui 
[14:07:16] <Amir1>	 I guess hard-coded to eqiad, maybe
[14:07:16] <akosiaris>	 tcpircbot probably needs a restart
[14:07:19] <taavi>	 the job dashboard works if you switch to the codfw dashboard
[14:07:20] <jynus>	 +1 on taht guess
[14:07:21] <marostegui>	 _joe_: yeah
[14:07:27] <akosiaris>	 I 'll handle that
[14:07:30] <taavi>	 no, that's stashbot expected behaviour to reduce spam here
[14:07:31] <marostegui>	 thanks akosiaris 
[14:07:31] <taavi>	 !log test
[14:07:32] <_joe_>	 thanks akosiaris 
[14:07:33] <akosiaris>	 that == tcpircbot
[14:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:45] <_joe_>	 ah I see
[14:07:47] <taavi>	 it shows the success message for humans, but for bots it's errors only
[14:07:50] <akosiaris>	 ah, here we are. thanks
[14:07:57] <marostegui>	 ah cool, problem solved thanks taavi 
[14:08:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[14:08:18] <claime>	 !log Phase 9.5 Update DNS records for new database masters - T327920
[14:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:24] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[14:08:30] <marostegui>	 claime: I will change pcX later, not important
[14:09:02] <claime>	 marostegui: ack
[14:09:07] <_joe_>	 Amir1: can you check logstash for unexpected mw errors?
[14:09:13] <Amir1>	 maybe we should have a script to create the dns change
[14:09:15] <_joe_>	 I'm taking a look at the cluster's health
[14:09:16] <Amir1>	 sure
[14:09:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:09:19] <jynus>	 I will check redlinks, category updates or transcodes for job execution
[14:09:56] <claime>	 !log Phase 9.5 DNS records for new database masters updated - T327920
[14:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters
[14:10:08] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-30m&to=now&viewPanel=9 lol amazing
[14:10:09] <jynus>	 transcodes seems to be happening well
[14:10:13] <Amir1>	 _joe_: nothing 
[14:10:15] <akosiaris>	 I screen captured btw listen to wikipedia, it will be a nice addition to the email :-) 
[14:10:26] <claime>	 akosiaris: <3
[14:10:34] <claime>	 I'll be able to relisten to it :D
[14:10:36] <marostegui>	 codfw masters they all seem stable
[14:10:37] <_joe_>	 wow no latency increase at all
[14:10:39] <claime>	 Nice memory
[14:10:47] <volans>	 [for later] the bach_size for the pupper run on DB hosts can be increased from the current 5
[14:10:50] <_joe_>	 thanks, multidc mediawiki
[14:10:54] <jynus>	 yeah, that is way way better
[14:10:59] <claime>	 _joe_: without warmup too
[14:11:03] <taavi>	 are the "MySQL server has gone away" errors for GrowthExperiments known/expected?
[14:11:15] <_joe_>	 I guess not
[14:11:20] <_joe_>	 taavi: link?
[14:11:22] <TheresNoTime>	 they've been going for longer than the switchover fwiw
[14:11:25] <marostegui>	 taavi: they've been there for a while 
[14:11:28] <_joe_>	 ah I see
[14:11:31] <_joe_>	 ok
[14:11:38] <_joe_>	 so working as expected(TM)
[14:11:40] <taavi>	 https://logstash.wikimedia.org/goto/195ec9c292098639e5fe4884d38fcf53
[14:11:42] <jynus>	 recategorizations also working well and fast see no obviou job issue atm
[14:11:50] <taavi>	 ah
[14:12:03] <TheresNoTime>	 "correctly broken" :)
[14:12:05] <marostegui>	 taavi: We still need to get someone to look at them, but they aren't related to the switch
[14:12:22] <claime>	 appserver/api_appserver/parsoid graphs looking healthy
[14:12:44] <Amir1>	 and the thread deadlocks are expected too
[14:12:51] <Amir1>	 like not expected but known
[14:13:04] <claime>	 puppet run on db still going btw
[14:13:07] <_joe_>	 marostegui: uhm they appeared after the switch though to be more frequent
[14:13:12] <marostegui>	 I am going to reduce db2122's weight a bit, as it is having a spike of load
[14:13:20] <akosiaris>	 IIRC, we used to need some rebalancing of databases after each switchover, is that still a case ?
[14:13:27] <marostegui>	 akosiaris: no, because we have multidc
[14:13:30] <akosiaris>	 and marostegui was faster than my question :-)
[14:13:50] <akosiaris>	 marostegui: your rebalancing begs to differ btw :P 
[14:13:54] <_joe_>	 akosiaris: it's possible that some rw-traffic changes slightly things
[14:14:09] <_joe_>	 but we're not in a situation on the cliff
[14:14:11] <jynus>	 plus job queue backlog
[14:14:12] <akosiaris>	 but it should be definitely way way better now (in theory)
[14:14:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2122 weight', diff saved to https://phabricator.wikimedia.org/P44913 and previous config saved to /var/cache/conftool/dbconfig/20230301-141414-marostegui.json
[14:14:25] <_joe_>	 akosiaris: we don't have ES on fire, for instance
[14:14:30] <akosiaris>	 yes
[14:14:36] <cdanis>	 I eagerly await work on T265386
[14:14:36] <stashbot>	 T265386: Make LoadMonitor server states more up-to-date and respond to outages more quickly - https://phabricator.wikimedia.org/T265386
[14:14:53] <Amir1>	 don't worry we will have all of this once we plan to repool eqiad after being depooled for a month
[14:15:03] <claime>	 _joe_: Also, no action needed for ES
[14:15:07] <claime>	 Like, none at all.
[14:15:11] <_joe_>	 claime: exactly
[14:15:20] <_joe_>	 in the past they'd be onfire for 10-15 minutes
[14:15:26] <akosiaris>	 Amir1: we are pooling it read only in a week
[14:15:28] <_joe_>	 with nice consequences for the appservers
[14:15:38] <claime>	 dowmtime removal in progres
[14:15:38] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0)
[14:15:45] <akosiaris>	 1 week fully on codfw, 7 weeks multidc with eqiad being the secondary, that's the plan
[14:15:46] <claime>	 And we're done with the cookbook now
[14:15:59] <Amir1>	 akosiaris: oh clever
[14:16:04] <claime>	 !log Removing scap lock - T327920
[14:16:07] <jynus>	 now yes, great job
[14:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:10] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[14:16:19] <TheresNoTime>	 great work y'all :)
[14:16:25] <volans>	 are we already good to resolve the status page "incident"?
[14:16:27] <claime>	 Merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/892428
[14:16:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[14:16:50] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update pcX DNS [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920)
[14:16:53] <_joe_>	 claime: ah that's a nice touch
[14:17:07] <_joe_>	 now you have to scap it :D
[14:17:09] <claime>	 thank legoktm for adding it to the procedure :D
[14:17:24] <wikibugs>	 (03Merged) 10jenkins-bot: debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[14:17:28] <_joe_>	 claime: ok with resolving the incident?
[14:17:34] <claime>	 Yes.
[14:17:49] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) It happend.   The next step, next week: debrief the process.
[14:17:51] <Amir1>	 I'll check your patch Manuel
[14:17:54] <claime>	 backporting change
[14:17:57] <marostegui>	 Amir1: thanks, no rush
[14:18:24] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]]
[14:19:04] <claime>	 <joke> k8s build is taking a long time *blows through nose* </joke>
[14:19:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "new values are correct." [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui)
[14:19:10] <jynus>	 question, is eqiad depooled?
[14:19:17] <claime>	 jynus: at what layer?
[14:19:26] <jynus>	 because I see almost no tls connections in mysql
[14:19:36] <claime>	 It's traffic depooled since yesterday
[14:19:39] <volans>	 it's depooled at traffic and mw and multiple service layers
[14:19:40] <akosiaris>	 that's expected, eqiad is doing almost nothing right now
[14:19:47] <claime>	 It's chillin'
[14:19:52] <cdanis>	 claime: it's celebrating
[14:19:53] <claime>	 enjoying its rest
[14:20:06] <jynus>	 I see, thanks, this is the graph that I saw not going back to normal
[14:20:17] <akosiaris>	 jynus: you got 1 week to wreak havoc in whatever you want in eqiad. in 1 week we repool it as readonly
[14:20:22] <cdanis>	 🍾 🎉
[14:20:30] <logmsgbot>	 !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[14:20:31] <jynus>	 https://grafana.wikimedia.org/goto/9xs6AJxVk?orgId=1
[14:20:41] <Amir1>	 I always felt like this time is like when a very busy airport is shut down for maintenance, now it's time to do all sorts of crazy
[14:20:43] <claime>	 Checking email flow
[14:21:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update pcX DNS [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui)
[14:21:10] <Amir1>	 I have a couple of schema changes I want to run there
[14:21:13] <Amir1>	 mwhahahahahahah
[14:21:59] <_joe_>	 Amir1: let's wait for tomorrow maybe
[14:22:01] <claime>	 Email flowing
[14:22:29] <claime>	 eventstreams flowing
[14:22:52] <claime>	 scap is finishing helmfile apply, and then we're done
[14:23:02] <taavi>	 can I re-start a mw maintenance script that I was running? or do you want me to wait a bit?
[14:23:14] <claime>	 Give it a sec I'm running a backport
[14:24:08] <claime>	 (I know it shouldn't conflict but I'm more comfortable that way <3)
[14:24:19] <marostegui>	 claime: pcX dns changes merged and deployed
[14:24:27] <claime>	 marostegui: awesome, thank you <3
[14:25:14] <claime>	 It's restarting php-fpm, 65% done
[14:25:51] <moritzm>	 are cookbook runs good to go or need more time for checks etc?
[14:26:18] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] (duration: 07m 54s)
[14:26:23] <claime>	 I'm done.
[14:26:24] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[14:26:35] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[14:27:03] <claime>	 !log End mediawiki datacenter switchover - T327920
[14:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:30] <godog>	 \o/ \o/ \o/ \o/ congrats and nicely done everyone
[14:27:36] <taavi>	 !log re-start persistRevisionThreadItems.php on itwiki from P44912 after DC switchover T315510
[14:27:37] <jbond>	 great work
[14:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:41] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[14:28:10] <wikibugs>	 (03PS1) 10Hashar: Revert "ci: Permit ES traffic from jenkins masters to relforge" [puppet] - 10https://gerrit.wikimedia.org/r/893457
[14:28:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "ci: Permit ES traffic from jenkins masters to relforge" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (owner: 10Hashar)
[14:28:33] <claime>	 moritzm: taavi you can go ahead
[14:28:42] <moritzm>	 ack, thx
[14:29:39] <Amir1>	 zabe: you probably need to restart your scripts too, from mwmaint200x
[14:29:50] <elukey>	 great work folks!
[14:29:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-canary
[14:30:28] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1006.eqiad.wmnet with OS bullseye
[14:30:37] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1008.eqiad.wmnet with OS bullseye
[14:30:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-canary
[14:32:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw
[14:33:21] <wikibugs>	 (03PS2) 10Hashar: elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705)
[14:33:28] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye
[14:34:52] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web
[14:34:54] <wikibugs>	 (03CR) 10Hashar: "The iptables rules from Jenkins to Relforge Elastic search are no more used. It was a one off experiment back in 2014/2015 :)  The rules a" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[14:37:20] <Amir1>	 marostegui: okay if I do some schema changes on eqiad masters?
[14:40:17] <wikibugs>	 10SRE: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T330881 (10Serviziperinternet)
[14:40:34] <marostegui>	 Amir1: We said no db maintenance till monday
[14:40:56] <marostegui>	 Amir1: also, eqiad -> codfw replication is still enabled (that's why)
[14:42:01] <Amir1>	 ah I forgot, sorry. It wasn't planned to replicate but I see, nothing urgent
[14:45:11] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1005
[14:45:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw
[14:45:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad
[14:47:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:48:21] <wikibugs>	 (03PS1) 10David Caro: harbor: move to epp template for the config file [puppet] - 10https://gerrit.wikimedia.org/r/893480
[14:48:23] <wikibugs>	 (03PS1) 10David Caro: harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481
[14:48:24] <claime>	 Thank you all for helping make this a really smooth switchover <3
[14:49:42] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482
[14:49:44] <TheresNoTime>	 jouncebot: nowandnext
[14:49:44] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400)
[14:49:44] <jouncebot>	 In 3 hour(s) and 10 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800)
[14:49:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:50:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 (owner: 10David Caro)
[14:52:21] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1005
[14:53:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:28] <volans>	 FYI if you re-run the wmf-update-known-hosts-production you get the DB master known hosts updated ;)
[14:54:51] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482 (owner: 10Hnowlan)
[14:56:03] <wikibugs>	 (03PS3) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504)
[14:56:39] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) (owner: 10David Caro)
[14:57:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad
[14:58:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:29] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482 (owner: 10Hnowlan)
[14:59:40] <wikibugs>	 (03PS1) 10Hashar: contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483
[14:59:42] <wikibugs>	 (03PS1) 10Hashar: contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920)
[14:59:44] <wikibugs>	 (03PS1) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909)
[15:00:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486
[15:01:03] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ml-cache1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:02:11] <wikibugs>	 (03CR) 10Hashar: "I am not sure who can best review this change to how the dsh targets are generated. Giuseppe has introduced the pattern in https://gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar)
[15:02:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:02:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff)
[15:04:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486
[15:04:51] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005']
[15:06:10] <hashar>	 !log Restarting Apache on Gerrit host
[15:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:00] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486
[15:08:21] <wikibugs>	 (03PS2) 10David Caro: harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481
[15:08:43] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi)
[15:09:05] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1006.eqiad.wmnet with OS bullseye
[15:09:18] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye
[15:09:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:47] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1005']
[15:12:23] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[15:12:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:13:45] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:13:52] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi)
[15:14:02] <wikibugs>	 (03CR) 10Muehlenhoff: "If there's preference for a dedicated/different category other than "misc-clusters", happy to amend." [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff)
[15:17:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-canary
[15:20:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[15:20:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-canary
[15:21:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:18] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005']
[15:23:29] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.elasticsearch.restart-nginx: Fix typo which breaks aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/893490
[15:26:11] <wikibugs>	 10SRE, 10SRE-tools, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff)
[15:26:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:26:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.elasticsearch.restart-nginx: Fix typo which breaks aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/893490 (owner: 10Muehlenhoff)
[15:28:11] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005']
[15:30:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-codfw
[15:35:36] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005']
[15:35:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:12] <wikibugs>	 10SRE, 10Citoid: citoid having stability issues - https://phabricator.wikimedia.org/T330768 (10JMeybohm) IIRC it's pretty common for citoid to get OOM killed from time to time and that that is kind of expected.
[15:39:05] <wikibugs>	 (03PS1) 10Jbond: P:confd: Add support for discovery facts [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849)
[15:39:21] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:39:34] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:39:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:09] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:41:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:34] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[15:44:45] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005']
[15:44:48] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:46:10] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10JMeybohm)
[15:46:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:10] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10JMeybohm) Welcome @Serviziperinternet! As of https://wikitech.wikimedia.org/wiki/Maps/External_usage "//maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Affiliates...
[15:49:48] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:54:48] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:56:24] <wikibugs>	 (03CR) 10Hashar: "recheck after deployment of https://gerrit.wikimedia.org/r/c/integration/config/+/893416" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[15:57:18] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005']
[15:57:45] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005']
[16:00:17] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1005.eqiad.wmnet with OS bullseye
[16:01:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-codfw
[16:02:10] <bblack>	 !log cr[23]-esams: manually adding brett's ssh-rsa to match https://gerrit.wikimedia.org/r/c/operations/homer/public/+/892551
[16:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync
[16:05:45] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jhathaway) @jbond SRV support does look interesting, it appears they did some work to make it more production ready, https://tickets.puppetlabs.com/browse/PUP-7550. There are a couple of open...
[16:10:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jhathaway) @MoritzMuehlenhoff & @jbond thanks for putting together this plan. I think the plan sounds really sensible. I am particularly curious as to how robust the backward compatibility is...
[16:11:29] <wikibugs>	 (03CR) 10JHathaway: "Would love if you could take another look at this when you have a moment." [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[16:12:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[16:15:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[16:15:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[16:16:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:10] <taavi>	 jouncebot: nowandnext
[16:16:10] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 43 minute(s)
[16:16:10] <jouncebot>	 In 1 hour(s) and 43 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800)
[16:16:30] <wikibugs>	 (03PS2) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031)
[16:16:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[16:17:16] <wikibugs>	 (03PS2) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545)
[16:17:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[16:17:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[16:17:36] <wikibugs>	 (03Merged) 10jenkins-bot: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[16:17:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[16:17:58] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]]
[16:18:03] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[16:19:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[16:19:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[16:20:00] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[16:20:19] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[16:20:20] <TheresNoTime>	 seeing T242031 getting work is exciting!
[16:20:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[16:20:47] <icinga-wm>	 PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-aptrepo-apt2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[16:21:11] <wikibugs>	 (03PS3) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545)
[16:21:16] <taavi>	 TheresNoTime: if you want to see that moving forward, reviews on https://gerrit.wikimedia.org/r/c/873892 would be very much appreciated
[16:21:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap bootstrap: use new installation mechanism [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[16:25:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:21] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]] (duration: 08m 23s)
[16:26:30] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[16:26:37] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) 05Open→03Stalled
[16:28:26] <XioNoX>	 !log rollback port 80 block in esams - T330683
[16:28:27] <brett>	 !log Remove dns3001 DNS request routing via juniper - T321309
[16:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:35] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[16:30:08] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) Please do not delete templates but fill them out:  **Link to site**: ... **Purpose/details about your project**: ... **Wikimedia Affiliate supporting project**: ...
[16:30:11] <wikibugs>	 (03PS1) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502
[16:36:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Address problems found by 'pint' [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182)
[16:36:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182)
[16:37:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[16:37:10] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10bd808) >>! In T330847#8656124, @Marostegui wrote: > would Thursday 9th at 16:00 UTC work for you all? That date and time work for me.
[16:39:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:42:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve: add replicas setting for Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey)
[16:42:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182)
[16:45:45] <wikibugs>	 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Vgutierrez)
[16:46:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:19] <wikibugs>	 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10hashar)
[16:52:25] <wikibugs>	 (03PS2) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502
[16:52:27] <wikibugs>	 (03PS1) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510
[16:56:35] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1005.eqiad.wmnet with OS bullseye
[16:57:02] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10dcaro) cc. @Raymond_Ndibe in case you want to try maintaindbusers at that time (uses labsdbaccounts)
[16:57:32] <wikibugs>	 (03PS2) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510
[16:58:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I have tried to limit the cases where we use confd data to influence puppet runs, because that will propagate with even 30 minutes of dela" [puppet] - 10https://gerrit.wikimedia.org/r/893502 (owner: 10Jbond)
[16:59:04] <_joe_>	 jbond: I might have misunderstood the intentions of your patch, but I hope my reservations are clear enough
[16:59:26] <_joe_>	 basically we have a couple places where puppet runs depend on conftool state and I dread it
[16:59:43] <_joe_>	 I fear that such functions would enable doing it more
[17:00:08] <jbond>	 _joe_: just about to  to jump into a meeting, migt ping yuo to chat about it tomorrow, but yes that is exactly what i was trying to do :)
[17:01:30] <_joe_>	 absolutely let's talk tomorrow :)
[17:01:36] <_joe_>	 (I'm also in a meeting)
[17:01:47] <jbond>	 cool thanks ill ping yuo tomorrow
[17:05:07] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005']
[17:06:43] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005']
[17:17:11] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:19:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff)
[17:19:49] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542)
[17:19:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916)
[17:20:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:21:12] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10lbowmaker)
[17:21:19] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916)
[17:21:53] <wikibugs>	 (03PS3) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510
[17:24:22] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ml-serve1006.eqiad.wmnet with OS bullseye
[17:24:47] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1005.eqiad.wmnet with OS bullseye
[17:25:05] <wikibugs>	 (03PS4) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510
[17:25:49] <jbond>	 cdanis: nothing to report (aokoth doeen't apear to be here at th emoment)
[17:25:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/893514/39893/labstore1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:25:56] <cdanis>	 jbond: <3
[17:26:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39894/console" [puppet] - 10https://gerrit.wikimedia.org/r/893510 (owner: 10Jbond)
[17:26:12] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade to k8s 1.23
[17:27:26] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916)
[17:27:55] <wikibugs>	 (03PS2) 10Elukey: admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542)
[17:35:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:07] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@9568478]: Deploy Airflow upgrade branch for analytics_test
[17:36:13] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@9568478]: Deploy Airflow upgrade branch for analytics_test (duration: 00m 05s)
[17:38:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 (owner: 10Jbond)
[17:38:36] <wikibugs>	 (03CR) 10David Caro: "This will stop being applied to labstore1004, as it will stop having maintaindbusers in it no?" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:38:59] <wikibugs>	 (03PS3) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502
[17:40:20] <wikibugs>	 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Ennomeijers) Thanks for the replies! Advising to use HTTPS over HTTP makes sense.   But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata a...
[17:40:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:40:55] <wikibugs>	 (03PS4) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848
[17:41:13] <wikibugs>	 (03CR) 10Raymond Ndibe: "Hello arturo, thanks for helping out with this! it wasn't exactly obvious where this change was to be added. I have one small question, th" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:41:22] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1005.eqiad.wmnet with reason: host reimage
[17:41:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett)
[17:43:12] <wikibugs>	 (03CR) 10David Caro: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:44:26] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1005.eqiad.wmnet with reason: host reimage
[17:45:37] <wikibugs>	 (03CR) 10Raymond Ndibe: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:46:40] <wikibugs>	 (03CR) 10David Caro: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez)
[17:46:55] <wikibugs>	 (03Abandoned) 10Andrew Bogott: OpenStack: rename 'user' role to 'member' [puppet] - 10https://gerrit.wikimedia.org/r/893036 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[17:47:00] <wikibugs>	 (03PS5) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848
[17:48:29] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:32] <wikibugs>	 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Nikki) I've noticed in the past few days that when I enter "wikidata.org" on my phone (using Vivaldi), it's sometimes really slow to load, but will load straightaway if I edit the URL to add https://. I do...
[17:52:24] <wikibugs>	 (03PS1) 10ArielGlenn: make sure all of dumpsdata1001-7 permit rsync from/to each other [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573)
[17:57:57] <wikibugs>	 (03PS3) 10Bking: elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[17:58:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[18:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800)
[18:00:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS buster
[18:01:38] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001"
[18:02:35] <wikibugs>	 (03PS1) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:07:27] <wikibugs>	 (03PS2) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:08:41] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:09:31] <icinga-wm>	 PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:10:27] <wikibugs>	 (03PS3) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:11:11] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:11:47] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:12:07] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:12:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001"
[18:12:19] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1005.eqiad.wmnet with OS bullseye
[18:12:23] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:27] <icinga-wm>	 RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 81.03 ms
[18:14:26] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar)
[18:15:31] <icinga-wm>	 PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:18:47] <wikibugs>	 (03PS4) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502
[18:19:06] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "adding the -1 back until discussed" [puppet] - 10https://gerrit.wikimedia.org/r/893502 (owner: 10Jbond)
[18:19:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:20:19] <icinga-wm>	 RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 81.06 ms
[18:20:37] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns3001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:21:21] <icinga-wm>	 PROBLEM - Check systemd state on dns3001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:21:23] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns3001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[18:22:22] <wikibugs>	 (03PS4) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:22:25] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns3001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:22:45] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:22:59] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:23:09] <icinga-wm>	 RECOVERY - Check systemd state on dns3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:23:11] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns3001 is OK: OK: UP (pid=2568) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[18:23:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39899/console" [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[18:23:53] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:24:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:25:30] <wikibugs>	 (03PS5) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:27:42] <wikibugs>	 (03PS6) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:28:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39901/console" [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[18:32:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite)
[18:33:42] <wikibugs>	 (03PS3) 10Cwhite: profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392
[18:37:06] <wikibugs>	 (03PS7) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849)
[18:44:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite)
[18:44:49] <wikibugs>	 (03CR) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond)
[18:45:33] <wikibugs>	 (03CR) 10ArielGlenn: "pcc output looks reasonable. https://puppet-compiler.wmflabs.org/output/893519/39903/" [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn)
[18:48:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SRamkisson) Approved
[18:48:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3001.wikimedia.org with OS bullseye
[18:49:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3001.wikimedia.org with OS bullseye
[18:50:23] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:53:45] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:53:55] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:54:53] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:55:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:55:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] define role owner for gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/892587 (owner: 10Dzahn)
[18:56:49] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:91:198:174:61 is DOWN: PING CRITICAL - Packet loss = 100%
[19:00:05] <jouncebot>	 jnuche and hashar: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1900).
[19:00:10] <wikibugs>	 (03CR) 10Dzahn: "ooh, I see! Well, I am glad I made an exception and did not try to switch this one without asking first. But let me get back to this soon." [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[19:00:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:05:06] <wikibugs>	 (03CR) 10Dzahn: "Thank you! I did not expect this and glad I asked. This makes sense to me now." [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[19:05:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:08:57] <wikibugs>	 (03PS2) 10Dzahn: devtools: change gerrit hostname to use wmcloud, not wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444)
[19:09:28] <wikibugs>	 (03CR) 10Dzahn: "This is now waiting for T330312. Once the instance is running again we can merge this and check everything is ok." [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn)
[19:09:39] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3001.wikimedia.org with reason: host reimage
[19:12:46] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3001.wikimedia.org with reason: host reimage
[19:25:38] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:26:00] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:26:12] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:27:24] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:30:16] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:30:38] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:30:46] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:31:54] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:32:16] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:32:26] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:36:55] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3001.wikimedia.org with OS bullseye
[19:37:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3001.wikimedia.org with OS bullseye completed: - dns3001 (**PASS**)   - Downtimed on Icinga/Al...
[19:39:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) @LSobanski  did you have the final Server figured out?
[19:47:48] <brett>	 !log re-adding dns3001 to next-hop routing via juniper - T321309
[19:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:55] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[19:50:55] <wikibugs>	 (03PS1) 10BCornwall: Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550
[19:51:00] <wikibugs>	 (03PS2) 10BCornwall: Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550
[19:51:36] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 (owner: 10BCornwall)
[19:52:48] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 (owner: 10BCornwall)
[19:54:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:03:33] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder)
[20:33:35] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder)
[20:40:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks! looks good to me. https://puppet-compiler.wmflabs.org/output/887738/39904/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[20:41:58] <wikibugs>	 (03PS6) 10Dzahn: contint: regroup common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[20:43:12] <zabe>	 !log move rev_comment_id migration screens from mwmaint1002 to mwmaint2002 # T275246
[20:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:18] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[20:44:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/887738/39905/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[20:51:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "confirmed this was a noop on contint1002 and contint2002" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar)
[20:51:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway)
[20:52:34] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "seems like there is still discussion about this on a mailing list" [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper)
[20:53:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) Quote:{F36887400}
[20:54:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway)
[20:54:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) Quote:{F36887400}
[20:54:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10taavi)
[20:55:24] <wikibugs>	 (03CR) 10Dzahn: "I am aware there might be more discussion waiting for how and where this should be hosted.. but on the other hand.. making this specific D" [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis)
[20:55:50] <icinga-wm>	 PROBLEM - Host dns2002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:57:58] <wikibugs>	 (03CR) 10Dzahn: "let's add John for his opinion on this" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[20:58:24] <icinga-wm>	 RECOVERY - Host dns2002 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[20:58:38] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:59:14] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:59:14] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:59:38] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T2100).
[21:00:05] <jouncebot>	 Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:14] * TheresNoTime can deploy
[21:00:34] <Superpes>	 Uh I completely forgot about it lol 
[21:00:38] <Superpes>	 Thanks TheresNoTime :P
[21:01:00] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:01:00] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:01:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[21:01:24] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:02:03] <wikibugs>	 (03Merged) 10jenkins-bot: [trwiki] Reverting logo change for Vector 2022 and Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[21:02:08] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:02:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2002.wikimedia.org with OS bullseye
[21:02:26] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]]
[21:02:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2002.wikimedia.org with OS bullseye
[21:02:32] <stashbot>	 T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047
[21:04:13] <logmsgbot>	 !log samtar@deploy2002 superpes and samtar: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:04:25] <TheresNoTime>	 Superpes: can you test? :)
[21:04:29] <Superpes>	 Looking :)
[21:05:34] <icinga-wm>	 PROBLEM - Host 2620:0:860:4:208:80:153:111 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:4:208:80:153:111)
[21:05:37] <Superpes>	 TheresNoTime It works on both Vector 2022 and Vector legacy :D
[21:05:44] <TheresNoTime>	 syncing :)
[21:06:18] <Superpes>	 Thanks :)
[21:06:42] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:07:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:08:06] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:08:06] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:08:24] <RhinosF1>	 I assume all the alerts is brett with dns2002
[21:08:41] <brett>	 oh balls, forgot to mention, yes
[21:09:26] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[21:09:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:09:49] <RhinosF1>	 brett: blame dns. Dns can be blamed for everything
[21:10:15] <RhinosF1>	 It shouldn’t be so noisy
[21:10:42] <TheresNoTime>	 https://www.irccloud.com/pastebin/jnZofe81/
[21:11:00] <RhinosF1>	 Heh
[21:11:06] <RhinosF1>	 I might have that framed TheresNoTime
[21:11:34] <TheresNoTime>	 :D
[21:11:56] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]] (duration: 09m 30s)
[21:12:02] <stashbot>	 T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047
[21:12:11] <TheresNoTime>	 Superpes: live, can you confirm?
[21:12:48] <Superpes>	 Yep confirm! Many thanks TheresNoTime :D
[21:13:12] <TheresNoTime>	 o7
[21:14:32] <icinga-wm>	 RECOVERY - Host 2620:0:860:4:208:80:153:111 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[21:14:42] * TheresNoTime will be around for another 15 minutes or so if there's any other patches needing deployment
[21:16:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2002.wikimedia.org with reason: host reimage
[21:18:08] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:860:4:208:80:153:111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[21:18:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2002.wikimedia.org with reason: host reimage
[21:19:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:23:53] <TheresNoTime>	 !log closing UTC late backport window
[21:23:53] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/889248/39906/" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[21:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:36] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a]: Regular analytics weekly train [analytics/refinery@d4d723a]
[21:28:58] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:860:4:208:80:153:111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[21:29:20] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.153.111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[21:30:14] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:31:02] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:33:30] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:33:30] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:35:24] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800)
[21:35:40] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:36:46] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Dzahn) Per the description on the front page, which I translated with Google Translate, the goal of this project is to "highlight the business sector,** large companies**", and "**very high level CEOs** bel...
[21:37:30] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:37:35] <wikibugs>	 (03PS1) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960)
[21:37:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2002.wikimedia.org with OS bullseye
[21:37:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2002.wikimedia.org with OS bullseye completed: - dns2002 (**PASS**)   - Downtimed on Icinga/Al...
[21:38:31] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a]: Regular analytics weekly train [analytics/refinery@d4d723a] (duration: 10m 55s)
[21:39:27] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a] (thin): Regular analytics weekly train THIN [analytics/refinery@d4d723a]
[21:39:34] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a] (thin): Regular analytics weekly train THIN [analytics/refinery@d4d723a] (duration: 00m 07s)
[21:39:51] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4d723a]
[21:40:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:41:14] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4d723a] (duration: 01m 22s)
[21:42:51] <wikibugs>	 (03PS1) 10BCornwall: ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539
[21:43:57] <wikibugs>	 (03PS2) 10BCornwall: ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539
[22:06:21] <wikibugs>	 (03PS2) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161)
[22:06:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite)
[22:09:05] <wikibugs>	 (03PS3) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161)
[22:09:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite)
[22:11:37] <wikibugs>	 (03PS4) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161)
[22:14:11] <wikibugs>	 (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/output/891394/39907/" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite)
[22:16:32] <icinga-wm>	 RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:36] <icinga-wm>	 PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-aptrepo-apt2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:16] <brett>	 Doing some firmware upgrades and then reimaging on dns1002
[22:37:30] <wikibugs>	 (03PS1) 10Nray: Revert "Add static "Cleopatra" page to facilitate synthetic testing of 885362" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326147)
[22:40:28] <icinga-wm>	 PROBLEM - Host dns1002 is DOWN: PING CRITICAL - Packet loss = 100%
[22:42:09] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@51e92b1]: (no justification provided)
[22:42:14] <icinga-wm>	 RECOVERY - Host dns1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[22:42:31] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@51e92b1]: (no justification provided) (duration: 00m 21s)
[22:42:39] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns1002.wikimedia.org with OS bullseye
[22:42:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns1002.wikimedia.org with OS bullseye
[22:43:04] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:43:08] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:43:16] <wikibugs>	 (03CR) 10Dzahn: "There is the systemd service "rsync-aptrepo-apt2001.wikimedia.org" on apt1001. And it fails because it tries to push from 1001 to 2001 but" [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond)
[22:45:28] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@1fb5c4a]: (no justification provided)
[22:45:36] <icinga-wm>	 PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:208:80:155:108)
[22:45:52] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@1fb5c4a]: (no justification provided) (duration: 00m 23s)
[22:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:50:48] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[22:51:00] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Dzahn) After the switch of the apt servers we are getting alerting about bad systemd status on apt1001.   ` <+icinga-wm> PROBLEM - Check systemd state...
[22:51:58] <mutante>	 !log apt1001 - systemctl reset-failed T328907
[22:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:04] <stashbot>	 T328907:  Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907
[22:52:34] <icinga-wm>	 RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:56:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1002.wikimedia.org with reason: host reimage
[22:57:28] <icinga-wm>	 RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms
[22:58:04] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) 05Stalled→03Declined
[23:01:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1002.wikimedia.org with reason: host reimage
[23:02:24] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:4:208:80:155:108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[23:10:46] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[23:11:26] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:4:208:80:155:108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[23:15:50] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack: collapse 'user' OpenStack role into 'reader' role [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759)
[23:21:10] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:21:14] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:23:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1002.wikimedia.org with OS bullseye
[23:23:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns1002.wikimedia.org with OS bullseye completed: - dns1002 (**PASS**)   - Downtimed on Icinga/Al...
[23:26:06] <wikibugs>	 (03PS1) 10BCornwall: ntp/eqiad: set to dns1002 [dns] - 10https://gerrit.wikimedia.org/r/893566
[23:27:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)