[00:01:10] zabe: hm.. how come this doesn't have a diffConfig diff? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/649609/ [00:04:38] Krinkle: my guess is, because we are removing deploymentwiki from all-labs and buildConfigCache.php goes through that dblist [00:05:17] Hm.. but then we check out the parent commit and run it again, where it should create the tmp json files for that wiki, right? [00:05:46] I'd expect the diff to be that the file was effectively removed [00:06:03] !log zabe@deploy2002 Finished scap: T198673 (duration: 07m 25s) [00:06:10] T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673 [00:06:21] (03CR) 10Dzahn: "Hi, so you have switched commons-query.wikimedia.org away from miscweb* but never removed the puppetization there. This lead to confusion " [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [00:06:47] Ah, I see. It will detect a file being added but not removed, because you can't "git add" the absence of a file. [00:07:32] the way it works is that it stages the "after" state, and then diffs against that. [00:08:36] so from the diff perspective the "new" file in the before state is untracked [00:09:01] and ignoring untracked files is important as we otherwise would also get all other gitgnored stuff in the diff [00:09:51] (03PS1) 10Superpes15: Revert "Change the trwiki logo with a temporary one (old vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892976 (https://phabricator.wikimedia.org/T329047) [00:09:59] (03CR) 10CI reject: [V: 04-1] Revert "Change the trwiki logo with a temporary one (old vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892976 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [00:10:09] To simulate it locally: touch x && git add x && rm x && git diff; that shows 'x' being removed. [00:10:10] but [00:10:19] ah, good catch [00:10:33] touch x && git diff; won't show that 'x' is added [00:12:25] (03PS1) 10Zabe: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 [00:13:28] (03PS2) 10Zabe: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 [00:13:40] (03CR) 10Zabe: [C: 03+2] Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 (owner: 10Zabe) [00:14:28] (03Merged) 10jenkins-bot: Update interwiki cache for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892624 (owner: 10Zabe) [00:16:16] (03PS1) 10Dzahn: remove commons-query virtual host from httpd on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) [00:18:19] (03PS2) 10Dzahn: remove commons-query virtual host from httpd on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) [00:18:34] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/893086/39875/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:22:11] (03PS1) 10Dzahn: httpbb/miscweb: add missing/new virtual hosts to tests [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) [00:23:04] (03CR) 10Dzahn: "[deploy1002:~] $ httpbb --hosts miscweb2002.codfw.wmnet ./test_miscweb.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:23:32] (03CR) 10Dzahn: [C: 03+2] httpbb/miscweb: add missing/new virtual hosts to tests [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:26:23] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:27:16] (03PS1) 10Krinkle: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 [00:29:39] (03CR) 10Cwhite: [C: 03+1] prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:29:58] (03CR) 10Cwhite: [C: 03+1] prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:30:53] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:32:45] (03CR) 10Dzahn: [C: 03+2] "research.wikimedia.org - https://phabricator.wikimedia.org/T107389" [puppet] - 10https://gerrit.wikimedia.org/r/893087 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [00:34:07] (03CR) 10Dzahn: "It was hard to find this because there was no ticket about the creation of this. I added missing tests in https://gerrit.wikimedia.org/r/c" [puppet] - 10https://gerrit.wikimedia.org/r/724416 (owner: 10Muehlenhoff) [00:34:48] (03PS1) 10Superpes15: [trwiki] Reverting logo change for Vector 2022 and Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047) [00:59:02] (03PS1) 10Dzahn: switch (www).wikiworkshop.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893090 (https://phabricator.wikimedia.org/T330090) [00:59:53] (03CR) 10Dzahn: [C: 03+2] switch (www).wikiworkshop.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893090 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [01:16:14] (03CR) 10Krinkle: "Test Plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [01:16:56] (03CR) 10Krinkle: "cc-ing Amir and Ahmon for awareness that this is/was a thing :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [01:33:01] (03PS3) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 [01:41:11] (03CR) 10Krinkle: "Verified as follows:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle) [01:45:31] (03PS2) 10Krinkle: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 [01:45:35] (03CR) 10Krinkle: [C: 03+2] noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 (owner: 10Krinkle) [01:46:20] (03Merged) 10jenkins-bot: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 (owner: 10Krinkle) [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [02:13:13] (03CR) 10Dzahn: [C: 03+2] "Ah, this is nice, thanks Kosta" [puppet] - 10https://gerrit.wikimedia.org/r/893001 (owner: 10Kosta Harlan) [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:03] (03PS2) 10Andrew Bogott: OpenStack: rename 'user' role to 'member' [puppet] - 10https://gerrit.wikimedia.org/r/893036 (https://phabricator.wikimedia.org/T330759) [02:52:05] (03PS1) 10Andrew Bogott: cinder policy.yaml: redefine xena_system_admin_or_project_member rule [puppet] - 10https://gerrit.wikimedia.org/r/893097 (https://phabricator.wikimedia.org/T330759) [02:57:39] (03CR) 10Andrew Bogott: [C: 03+2] cinder policy.yaml: redefine xena_system_admin_or_project_member rule [puppet] - 10https://gerrit.wikimedia.org/r/893097 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [04:12:10] (03CR) 10Ebernhardson: "This should still be using miscweb, just not quite as directly. The requests go to a wcqs instance first, then the nginx there forwards ap" [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [04:24:13] (03CR) 10Ebernhardson: [C: 04-1] "These are used, but not in the typical manner. The requests initially land at the wcqs instances directly so that it can put an oauth flow" [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [04:29:49] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10colewhite) There's a few ways we can cut these out. Maybe first try something simple like: ` "must_not":... [04:30:09] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10Observability-Logging, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10colewhite) [04:58:01] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:25] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:35] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:01:54] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:02:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:11:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 216.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [05:26:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [05:30:24] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:21] !log Stop mysql on codfw sanitarium host db2095 (s2, s7, s6, s4) to clone db2187 T326596 [05:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:25] T326596: Productionize db218[567] - https://phabricator.wikimedia.org/T326596 [06:03:24] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:03:36] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:04:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:04:18] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:11:01] (03PS1) 10Marostegui: install_server: Do not reimage db2185 [puppet] - 10https://gerrit.wikimedia.org/r/893104 (https://phabricator.wikimedia.org/T326596) [06:13:49] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2185 [puppet] - 10https://gerrit.wikimedia.org/r/893104 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [06:14:56] !log Stop MySQL on db2094 T330828 [06:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:02] T330828: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 [06:16:54] (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893105 (https://phabricator.wikimedia.org/T330827) [06:17:37] (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893105 (https://phabricator.wikimedia.org/T330827) (owner: 10Marostegui) [06:26:58] (03PS1) 10Marostegui: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) [06:27:38] (03PS2) 10Marostegui: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) [06:28:25] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui) [06:28:55] (03Merged) 10jenkins-bot: control-mariadb-client-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893106 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui) [06:30:18] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:34:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:16] 10SRE, 10Service-deployment-requests: Kaynak - https://phabricator.wikimedia.org/T330830 (10Metin6201) [06:37:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:32] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:41:38] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:42:28] (03PS1) 10ArielGlenn: make dumpsdata1004 the xmlfallback host, with dumpsdata1001 as xml spare [puppet] - 10https://gerrit.wikimedia.org/r/893265 (https://phabricator.wikimedia.org/T330573) [06:45:08] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:28] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:46:10] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:55:41] (03CR) 10ArielGlenn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/893265/39876/ looks as expected. We have the latest data rsynced from dumpsdata1003 and 1002, s" [puppet] - 10https://gerrit.wikimedia.org/r/893265 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [06:57:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0700) [07:03:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 221.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [07:03:48] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:05:06] (03Abandoned) 10ArielGlenn: delay start of the March xml dump rn unti the evening [puppet] - 10https://gerrit.wikimedia.org/r/893055 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [07:12:34] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:12:44] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:13:00] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:13:25] (03PS2) 10ArielGlenn: Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) [07:13:37] (03CR) 10CI reject: [V: 04-1] Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [07:13:48] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:16:26] (03PS3) 10ArielGlenn: Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) [07:23:06] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adap [07:23:06] nks to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:24:38] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:26:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:40] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) Wonderful, we have claimed them already :-) Thank you! [07:29:56] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) is CRITICAL: Test Suggest a target title for the given source title and language pairs r [07:29:56] the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [07:31:38] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:33:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [07:44:23] (03CR) 10Elukey: [C: 03+1] hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [07:47:38] (03PS1) 10Marostegui: mariadb: Productionize db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893390 (https://phabricator.wikimedia.org/T326596) [07:48:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893390 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [07:49:37] (03PS1) 10Marostegui: site.pp: Remove db2187 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/893391 (https://phabricator.wikimedia.org/T326596) [07:50:06] (03CR) 10Elukey: [C: 03+2] ores: change monitoring for the service [puppet] - 10https://gerrit.wikimedia.org/r/893008 (owner: 10Elukey) [07:50:25] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2187 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/893391 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [07:51:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:24] (03PS1) 10Elukey: role::etcd::v3::ml_etcd: prepare eqiad cluster for reimage/boostrap [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758) [07:57:01] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hive: Fix max metaspace size of hiveserver2 to 512m [puppet] - 10https://gerrit.wikimedia.org/r/893029 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [07:57:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39877/console" [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [07:58:58] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::ml_etcd: prepare eqiad cluster for reimage/boostrap [puppet] - 10https://gerrit.wikimedia.org/r/893392 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [08:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0800). [08:00:05] aharoni: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:46] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: T330758 [08:10:47] T330758: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 [08:10:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: T330758 [08:11:31] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1003.eqiad.wmnet with OS bullseye [08:11:45] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1002.eqiad.wmnet with OS bullseye [08:11:53] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd1001.eqiad.wmnet with OS bullseye [08:14:56] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2184.codfw.wmnet with reason: 10.6 recovery [08:15:09] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2184.codfw.wmnet with reason: 10.6 recovery [08:15:26] PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [08:16:19] (03PS1) 10Marostegui: check_private_data_report: Remove db2094 [puppet] - 10https://gerrit.wikimedia.org/r/893395 (https://phabricator.wikimedia.org/T330828) [08:16:49] (03CR) 10Muehlenhoff: [C: 03+2] Install pbuilder hook for ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/893014 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [08:16:54] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove db2094 [puppet] - 10https://gerrit.wikimedia.org/r/893395 (https://phabricator.wikimedia.org/T330828) (owner: 10Marostegui) [08:16:57] (03PS1) 10Ayounsi: Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396 [08:19:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the addition to the ops group needs on-patch or on-tasj approval by Joanna, though." [puppet] - 10https://gerrit.wikimedia.org/r/893396 (owner: 10Ayounsi) [08:21:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1002.eqiad.wmnet with reason: host reimage [08:21:28] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1003.eqiad.wmnet with reason: host reimage [08:21:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd1001.eqiad.wmnet with reason: host reimage [08:24:00] PROBLEM - Check systemd state on cp5023 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1002.eqiad.wmnet with reason: host reimage [08:24:48] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:25:20] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [08:25:22] RECOVERY - Check systemd state on cp5023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:07] !log stopping db2184 for testing mariadb 10.6 recovery workflow T319383 [08:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:12] T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction") - https://phabricator.wikimedia.org/T319383 [08:26:22] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 482 days) https://wikitech.wikimedia.org/wiki/Logs [08:26:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1003.eqiad.wmnet with reason: host reimage [08:27:48] urbanecm, Amir1, sorry, couldn't connect earlier. Is it still possible to do the backport of those namespace patches? [08:28:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd1001.eqiad.wmnet with reason: host reimage [08:29:48] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:29] (03PS1) 10Muehlenhoff: Remove access for echetty [puppet] - 10https://gerrit.wikimedia.org/r/893397 [08:32:43] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10ayounsi) If there is an existing open task it will append to it. Here it was a coincidence that it stopped seeing the VCP issue at the same time as started to see the db2099 issue. [08:33:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for echetty [puppet] - 10https://gerrit.wikimedia.org/r/893397 (owner: 10Muehlenhoff) [08:34:19] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Emil Chetty out of all services on: 1110 hosts [08:35:07] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Emil Chetty out of all services on: 1110 hosts [08:36:19] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Emil Chetty out of all services on: 918 hosts [08:37:43] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Emil Chetty out of all services on: 918 hosts [08:40:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1001.eqiad.wmnet with OS bullseye [08:41:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:41:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1002.eqiad.wmnet with OS bullseye [08:41:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:41:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-etcd1003.eqiad.wmnet with OS bullseye [08:42:49] !log root@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23 [08:43:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 240.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [08:45:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::{master,worker}: update ml-serve-eqiad to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892995 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [08:45:24] !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl1001.eqiad.wmnet with OS bullseye [08:51:04] !log upgrade mw/eqiad to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [08:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:55] (03PS8) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [08:53:52] (03CR) 10Vgutierrez: acme_chief: support several passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [08:56:29] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage [08:57:40] (03CR) 10Hashar: "CI fails cause the code relies on the `semver-cli` command ( https://github.com/davidrjonas/semver-cli ) which is introduced by https://ge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [08:58:53] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage [09:00:05] jnuche and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T0900). [09:01:06] o/ [09:03:08] (03PS1) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) [09:05:08] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Thank you for the reviews!" [debs/pint] - 10https://gerrit.wikimedia.org/r/892992 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:05:38] (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:06:09] hi, will deploy in 5 mins [09:07:22] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > @jbond could you have a look at this anytime soon? @demon from my side the change is very minimal, just let me know if you wo... [09:13:24] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588) [09:13:27] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:14:07] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893401 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:15:04] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet with OS bullseye [09:15:28] !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-serve-ctrl1002.eqiad.wmnet with OS bullseye [09:16:57] (03CR) 10Jcrespo: "Moritz- question- is there a way to mark a file to not check its license? I don't want to have a license on the production file itself, bu" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:22:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39878/console" [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:23:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 208.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [09:23:06] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.25 refs T325588 [09:23:11] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [09:24:56] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: refactor blackbox configuration [puppet] - 10https://gerrit.wikimedia.org/r/892965 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:26:08] RECOVERY - Check systemd state on dumpsdata1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:16] (03PS1) 10Muehlenhoff: Add SPDX exception for myloader_defaults_override.cnf [puppet] - 10https://gerrit.wikimedia.org/r/893405 [09:26:34] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage [09:27:03] (03CR) 10Muehlenhoff: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:30:55] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.40.0-wmf.25 refs T325588 (duration: 07m 48s) [09:31:00] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [09:31:08] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: host reimage [09:31:34] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro [09:31:54] (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:33:08] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/892587 (owner: 10Dzahn) [09:33:39] (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:34:10] PROBLEM - Check systemd state on mw1428 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:42] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39879/console" [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:37:44] (03CR) 10Muehlenhoff: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:38:53] !log installing tiff security updates [09:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:35] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=eqiad [09:40:48] (03PS2) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) [09:41:10] (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:41:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8309 [09:42:44] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:42:48] (03CR) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:43:43] (03PS3) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) [09:44:04] (03CR) 10CI reject: [V: 04-1] dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:44:50] (03PS2) 10Filippo Giunchedi: prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) [09:44:52] (03PS2) 10Filippo Giunchedi: prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) [09:46:03] (03CR) 10Jcrespo: "This was not an arbitrary request- it was important for me that this workaround was as simple in production as possible (plus myloader tre" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:46:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add pint support [puppet] - 10https://gerrit.wikimedia.org/r/892986 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:46:26] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:21] (03PS4) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) [09:47:24] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet with OS bullseye [09:48:13] (03Abandoned) 10Jcrespo: Add SPDX exception for myloader_defaults_override.cnf [puppet] - 10https://gerrit.wikimedia.org/r/893405 (owner: 10Muehlenhoff) [09:51:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8309 [09:52:07] (03PS34) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [09:52:46] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/893400/39880/dbprov1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [09:54:50] (03PS3) 10Filippo Giunchedi: prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) [09:54:52] (03PS1) 10Filippo Giunchedi: prometheus: update pint listen port [puppet] - 10https://gerrit.wikimedia.org/r/893406 (https://phabricator.wikimedia.org/T309182) [09:54:54] (03PS1) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182) [09:55:26] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:17] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bullseye [09:57:56] !log Stop db1117:3325 and db1176 T329478 [09:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:00] T329478: Move db1176 to m5 - https://phabricator.wikimedia.org/T329478 [09:58:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bullseye [09:59:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update pint listen port [puppet] - 10https://gerrit.wikimedia.org/r/893406 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:59:13] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add pint source to ops [puppet] - 10https://gerrit.wikimedia.org/r/892987 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:59:13] There will be haproxy irc alerts for the above operation on db1117 [09:59:18] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bullseye [09:59:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bullseye [10:00:32] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:01:43] (03PS1) 10Marostegui: mariadb: Move db1176 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/893408 (https://phabricator.wikimedia.org/T329478) [10:02:11] !log dcaro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1010.eqiad.wmnet [10:02:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1176 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/893408 (https://phabricator.wikimedia.org/T329478) (owner: 10Marostegui) [10:02:50] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:03:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [10:03:06] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:24] (03CR) 10Jcrespo: [C: 04-1] "Actually, this doesn't work, we still get lock with this. Retrying with a different syntax." [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [10:04:40] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:56] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:06:25] (03PS5) 10Hashar: contint: regroup common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) [10:06:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [10:06:50] (03PS4) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 [10:07:13] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 (owner: 10Volans) [10:08:12] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:24] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:06] (03Merged) 10jenkins-bot: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 (owner: 10Volans) [10:09:44] (03CR) 10Jcrespo: [C: 04-2] "Giving up because this doesn't work at all. (also tested alternative syntax [myloader]" [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [10:11:50] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [10:11:52] (03Abandoned) 10Jcrespo: dbbackups: Implement myloader override in all hosts where it is installed [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [10:13:19] (03CR) 10Hashar: "Compiler result https://puppet-compiler.wmflabs.org/output/887738/1642/" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [10:13:27] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [10:13:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:49] (03CR) 10Hashar: [C: 03+1] Revert "contint: remove obsolete firewall rules from labs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887363 (https://phabricator.wikimedia.org/T114209) (owner: 10Hashar) [10:13:59] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [10:14:17] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [10:14:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [10:16:51] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [10:17:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [10:17:48] (03PS1) 10Jbond: apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 [10:18:04] (03CR) 10Jobo: [C: 03+2] Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396 (owner: 10Ayounsi) [10:18:13] (03PS1) 10Majavah: P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410 [10:19:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39881/console" [puppet] - 10https://gerrit.wikimedia.org/r/893409 (owner: 10Jbond) [10:19:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [10:19:50] (03CR) 10Vgutierrez: [C: 03+1] P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410 (owner: 10Majavah) [10:21:25] (03CR) 10Muehlenhoff: "Or we just move back DNS? This will probably only cause confusion and apt.w.o is pretty unrelated to the wider DC switchover? (like idp.w." [puppet] - 10https://gerrit.wikimedia.org/r/893409 (owner: 10Jbond) [10:22:25] (03PS2) 10Jbond: apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) [10:25:36] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [10:26:34] (03CR) 10Vgutierrez: [C: 03+2] P:acme_chief::cloud: support multiple passives [puppet] - 10https://gerrit.wikimedia.org/r/893410 (owner: 10Majavah) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:59] (03CR) 10Muehlenhoff: [C: 03+1] apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [10:29:22] PROBLEM - Host ml-serve1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:34] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:38] (03CR) 10Clément Goubert: apt: swap active and failover apt servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [10:30:08] RECOVERY - Host ml-serve1004 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [10:30:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bullseye [10:32:35] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [10:33:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bullseye [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [10:35:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1002.eqiad.wmnet with OS bullseye [10:37:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bullseye [10:39:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:41:18] (03PS1) 10Hashar: contint: Jenkins master > controller [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) [10:42:48] 10Puppet, 10Infrastructure-Foundations, 10Packaging: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10jbond) p:05Triage→03Medium [10:43:27] (03CR) 10Muehlenhoff: [C: 03+1] apt: swap active and failover apt servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [10:43:57] (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [10:44:04] (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:44:44] (03CR) 10Elukey: [C: 03+2] admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [10:47:57] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [10:49:53] (03Merged) 10jenkins-bot: admin_ng: upgrade ml-serve-eqiad to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892996 (https://phabricator.wikimedia.org/T330758) (owner: 10Elukey) [10:50:57] (03CR) 10Jbond: [C: 03+2] apt: swap active and failover apt servers [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [10:54:56] (03PS35) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [10:55:31] (03PS2) 10Ayounsi: Add Jameel to ops and users in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893396 [10:56:59] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [10:57:32] !log upgrade cloudweb to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [10:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:48] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:58:51] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:59:05] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:59:07] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:59:23] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:59:29] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:59:43] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:59:50] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:59:55] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1100) [11:00:33] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:01:11] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:01:23] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:01:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [11:01:33] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:01:37] RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:01:43] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:45] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:02:29] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:02:32] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:02:41] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:03:06] (03PS36) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:03:09] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:03:16] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:03:22] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:03:27] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:03:40] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:03:52] elukey@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [11:04:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1005.eqiad.wmnet with OS bullseye [11:04:12] hm, why did that fail? [11:04:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [11:04:52] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [11:05:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bullseye [11:05:47] ACKNOWLEDGEMENT - SSH on an-worker1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Nicolas Fraison Reboot https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:07:14] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dcaro@cumin1001" [11:07:15] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:16] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcephosd1010.eqiad.wmnet [11:07:29] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:07:32] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:07:45] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:08:11] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:41] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:58] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:09:01] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:12:23] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:12:45] Checking SAL logging [11:14:06] (03PS1) 10Elukey: kserve: add replicas setting for Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) [11:15:08] taavi: Very weird, everything got logged except the PASS [11:15:26] yeah, maybe a temporary fail [11:16:20] (03CR) 10Klausman: kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [11:17:25] (03CR) 10Elukey: kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [11:20:33] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Drop RBAC rules for deprecated resources [puppet] - 10https://gerrit.wikimedia.org/r/889836 (https://phabricator.wikimedia.org/T329869) (owner: 10Majavah) [11:23:29] (03CR) 10Jbond: [C: 03+1] "lgtm ping me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [11:27:21] (03PS1) 10Marostegui: check_private_data_report: Add db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893420 (https://phabricator.wikimedia.org/T326596) [11:27:45] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Add db2187 [puppet] - 10https://gerrit.wikimedia.org/r/893420 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [11:27:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: use api gateway for jobs cli [puppet] - 10https://gerrit.wikimedia.org/r/892370 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [11:28:20] 10Puppet, 10Infrastructure-Foundations, 10Packaging: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10jbond) [11:28:43] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:33:36] (03PS1) 10Jbond: puppet::agent: Add external facts directory [puppet] - 10https://gerrit.wikimedia.org/r/893421 [11:33:57] (03CR) 10Jbond: [C: 03+2] puppet::agent: Add external facts directory [puppet] - 10https://gerrit.wikimedia.org/r/893421 (owner: 10Jbond) [11:34:54] arturo: happy for me to merge yours change [11:35:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:36:20] jbond: sorry, please go! [11:37:28] don [11:37:30] e [11:38:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1ing to avoid accidental merge until the dependent restbase change gets merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [11:39:11] (03CR) 10Marostegui: dbbackups: Implement myloader override in all hosts where it is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893400 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [11:40:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Noting also that this release bumps mathoid to node16 (see https://gerrit.wikimedia.org/r/c/mediawiki/services/mathoid/+/866666)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [11:40:23] RECOVERY - Check systemd state on mw1428 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:49] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:57] (03PS3) 10Hnowlan: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [11:46:16] (03CR) 10CI reject: [V: 04-1] helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:49:48] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:58] (03CR) 10Klausman: [C: 03+1] kserve: add replicas setting for Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [11:54:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:05] (03PS3) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [11:58:47] !log upgrade parse/eqiad to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [11:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:54] (03CR) 10Btullis: "This is the SparkApplication that I have been using to test this chart. Note the use of the `spark-driver` serviceAccount." [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:00:55] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:07:26] (03PS1) 10Vgutierrez: acme_chief: Enforce passive_hosts as a list of FQDN [puppet] - 10https://gerrit.wikimedia.org/r/893425 (https://phabricator.wikimedia.org/T321309) [12:10:34] (03PS2) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182) [12:10:36] (03PS1) 10Filippo Giunchedi: prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) [12:11:02] (03CR) 10CI reject: [V: 04-1] prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:11:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39882/console" [puppet] - 10https://gerrit.wikimedia.org/r/893425 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [12:17:37] (03PS37) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [12:18:12] (03PS2) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [12:19:43] (03CR) 10CI reject: [V: 04-1] Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [12:20:59] (03CR) 10Ladsgroup: [C: 03+1] build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [12:22:17] (03PS3) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [12:23:26] (03PS2) 10Filippo Giunchedi: prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) [12:23:28] (03PS3) 10Filippo Giunchedi: prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182) [12:25:36] (03PS2) 10Ladsgroup: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [12:26:57] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:28:05] !log upgrade mwmaint to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [12:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:06] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39884/console" [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:36:51] (03CR) 10Ottomata: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:38:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please @andrew double check." [puppet] - 10https://gerrit.wikimedia.org/r/892944 (owner: 10Majavah) [12:38:26] (03PS1) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [12:39:42] (03PS1) 10Marostegui: db2183: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/893469 (https://phabricator.wikimedia.org/T330861) [12:40:03] (03PS2) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [12:40:14] !log Upgrade db2183 to 10.6 T330861 [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:20] T330861: Migrate backup1-* masters to MariaDB 10.6 - https://phabricator.wikimedia.org/T330861 [12:40:34] (03CR) 10Marostegui: [C: 03+2] db2183: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/893469 (https://phabricator.wikimedia.org/T330861) (owner: 10Marostegui) [12:42:44] (03PS3) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [12:46:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39887/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:51:04] (03PS1) 10Jbond: confd::file: drop relative prefix [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) [12:54:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39888/console" [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:54:40] (03CR) 10Jbond: [V: 03+1] "pcc shows a diff to core_resources but its only white-space" [puppet] - 10https://gerrit.wikimedia.org/r/893471 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [12:54:43] (03PS1) 10Jaime Nuche: scap bootstrap: use new installation mechanism [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) [13:05:18] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: drop standalone jobs ingress [puppet] - 10https://gerrit.wikimedia.org/r/893474 (https://phabricator.wikimedia.org/T329443) [13:08:32] (03PS2) 10Krinkle: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 [13:08:35] (03CR) 10Krinkle: [C: 03+2] build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [13:09:18] (03Merged) 10jenkins-bot: build: Change diffConfig to use git-stash instead of git-add [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [13:09:35] (03PS3) 10Krinkle: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) [13:09:39] (03CR) 10Krinkle: [C: 03+2] Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [13:09:59] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [13:10:05] !log Adding scheduled maintenance for switchover to statuspage - T327920 [13:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:09] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:10:19] (03Merged) 10jenkins-bot: Remove config for former Rdbms logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893084 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [13:11:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [13:11:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::k8s::haproxy: drop standalone jobs ingress [puppet] - 10https://gerrit.wikimedia.org/r/893474 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [13:11:49] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:11:50] jouncebot: nowandnext [13:11:50] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [13:11:51] In 0 hour(s) and 48 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400) [13:12:10] 48 minutes to go, how exciting [13:12:24] TheresNoTime: I'm pushing some minor config clean up pathces a.t.m. [13:12:32] I can stop though [13:13:21] I was just curious how long there was until the switch, not my call on if you need to stop :D [13:13:22] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:13:34] k :) [13:14:11] (03PS2) 10Krinkle: filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy) [13:14:15] (03CR) 10Krinkle: [C: 03+2] filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy) [13:14:28] * Krinkle testing on mwdebug2001 [13:14:57] (03Merged) 10jenkins-bot: filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy) [13:15:06] (03PS3) 10Krinkle: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [13:15:10] (03CR) 10Krinkle: [C: 03+2] filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [13:15:50] (03Merged) 10jenkins-bot: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [13:16:37] Krinkle: I will be locking scap deployments at 1330UTC [13:17:06] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [13:17:09] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: apt: improve apt failover ochastration - https://phabricator.wikimedia.org/T330849 (10Volans) We should find a standard setup for those use cases, I can see Netbox having exactly the same issue/requirement (some puppet-driver resourc... [13:17:50] I would ask that everybody refrain from running cookbooks or other starting at 1330UTC too [13:18:01] (03CR) 10Majavah: "This could also help drop some local hacks from the deployment-prep puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [13:18:06] (since I can't lock that down, I'm counting on y'all) [13:18:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [13:18:32] ack [13:19:31] !log krinkle@deploy2002 Synchronized wmf-config/: Ie063fbf91d5b41e0 - Remove config for former Rdbms logging (duration: 07m 39s) [13:19:46] (03CR) 10Majavah: kubeadm: update wmcs-k8s-get-cert for certificates/v1 [puppet] - 10https://gerrit.wikimedia.org/r/890502 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah) [13:19:56] same goes for helm chart deploys, homer runs [13:20:23] yes [13:20:24] ack [13:21:04] I wonder if also netbox changes, maybe a shoutout to dc-ops might be worth [13:21:15] yep. doing [13:21:27] (03CR) 10Krinkle: [C: 03+2] "Tested by re-rendering https://en.wikipedia.org/wiki/ImageMagick and by purging thumbs of https://commons.wikimedia.org/wiki/File:ImageMag" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [13:23:00] * Krinkle is done [13:23:04] Thanks <3 [13:23:53] err. it's taking af ew more minutes t finish the sync actually, my bad. [13:24:08] I'm done testing the second change, should be done in ~5min [13:24:15] syncs take longer than they used to [13:24:34] It's ok, I should have communicated that I wanted a larger berth for deployments here and not just in -sre [13:25:01] it's ok Krinkle, this way we can use you as scapegoat if the need arises :-P [13:28:03] now *that's* planning ahead :> [13:30:34] Holding until last deployment is done [13:30:54] !log krinkle@deploy2002 Synchronized wmf-config/: I3beefbf4ee3d66 filebackend cleanup (duration: 07m 13s) [13:31:02] right on the clock [13:31:04] !log Locking scap deployments for datacenter switchover - T327920 [13:31:05] * Krinkle is actually done [13:31:06] <_joe_> :) [13:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:09] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:31:31] <_joe_> action item: add scap locking to the switchdc cookbook [13:32:19] _joe_: I added it to my checklist, which we'll use as base for improvements [13:33:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) These hosts are correctly added to the partman recipe regex. [13:34:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:37:35] I think we're all set, now we wait [13:37:50] jnuche: how the train went this morning? ;) [13:37:59] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [13:38:07] <_joe_> claime: you can start with step 0 whenever you want btw [13:38:13] dcaro: wth [13:39:15] hashar: went well, logged an already existing issue not related to 1.40.0-wmf.25 [13:39:18] Starting step 0, everybody good? [13:39:18] other than that logs are quiet [13:39:37] <_joe_> claime: +1 [13:39:46] <_joe_> you can skip the warmup ofc [13:40:10] !log Starting mediawiki datacenter switchover step 0 - T327920 [13:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:15] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [13:40:16] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1015 to rack F4 - dcaro@cumin1001" [13:40:28] dcaro: Please stop any changes, we are starting with the DC switch [13:40:36] Executing cookbook sre.switchdc.mediawiki with args: ['eqiad', 'codfw'] claime +1 for ARGS :D [13:40:55] <_joe_> +1 here too [13:40:56] * claime deep breaths [13:41:11] <_joe_> 🤠 it [13:41:13] Waiting on puppet.sync-netbox [13:41:22] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1015 to rack F4 - dcaro@cumin1001" [13:41:22] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:41:26] Let's go [13:41:28] \facepalm, ack, will stop doing anything [13:41:33] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:41:35] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:41:36] <_joe_> dcaro: thanks :) [13:41:41] jnuche: excellent :-] [13:41:45] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [13:41:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [13:41:57] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:42:02] Skipping warmup [13:42:46] 5 minutes mandatory wait for TTL change [13:42:59] <_joe_> yes [13:43:18] Will do a GO/NOGO check before disabling maintenance [13:43:25] <_joe_> then the steps that need to happen in sequence start, so yeah [13:43:27] And a final GO/NOGO before entering RO phase [13:43:35] +1 [13:43:35] OK [13:43:37] <_joe_> ack [13:43:54] In any case, I will not enter RO before 1400 [13:43:59] _joe_: do you recall why we don't just nuke the recursors's cache for those records instead of the sleep? [13:44:04] claime: cool [13:44:08] claime: +1 [13:44:10] volans: I vote tech debt [13:44:19] <_joe_> volans: gives a nice breathing room before step 1 [13:44:21] <_joe_> :D [13:44:24] But also yeah [13:44:25] lol [13:44:26] Breather [13:44:43] <_joe_> we did discuss removing it, decided against it [13:45:07] ck [13:45:09] *ack [13:45:11] Getting my pot of tea ready [13:45:19] akosiaris: No gyros? [13:45:55] Shrimp and Taramasalata actually today [13:46:10] I had gyros a couple of days ago though [13:46:14] :( [13:46:20] (03CR) 10Jaime Nuche: "Added patch to Puppet deployment window tomorrow Thursday, after train and DC switchover are complete: https://wikitech.wikimedia.org/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [13:46:21] I had gyros yesterday [13:46:25] Well, kebab [13:46:27] same difference [13:46:30] :p [13:46:34] * _joe_ playing "Ain't no mountain high enough" [13:46:59] <_joe_> (Marvin Gaye and Tammi Terrell, fyi) [13:47:03] recommended listening during the switchover: https://en.wikipedia.org/wiki/Listen_to_Wikipedia [13:47:12] <_joe_> and that yes [13:47:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [13:47:38] TTLs set [13:47:45] GO/NOGO maintenance stop [13:48:16] go [13:48:39] <_joe_> go [13:48:52] Heads up Emperor jbond [13:49:09] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:49:21] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [13:49:34] great [13:49:44] ----- OUTPUT of 'systemctl list-u...t 255 || exit 0'' ----- [13:49:46] <_joe_> claime: do not despair [13:49:46] static [13:49:51] node=mwmaint1002.eqiad.wmnet, rc=124, command='systemctl list-units 'mediawiki_job_*' --no-legend | awk '{print $1}' | xargs -n 1 sh -c 'systemctl is-enabled $0 && exit 255 || exit 0'' [13:50:14] <_joe_> ah we have some failed units [13:50:15] f- [13:50:17] <_joe_> lol [13:51:08] <_joe_> ok so [13:51:13] <_joe_> all timers seem to be down now [13:51:31] fails reset [13:51:37] re-running step [13:51:39] we can re-run it (or all for that matters) as they should be idempotent [13:51:40] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:51:40] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1007.eqiad.wmnet with OS bullseye [13:52:06] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [13:52:10] yay [13:52:11] There. [13:52:12] <_joe_> cool [13:52:16] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [13:52:19] Breathing until 1400 [13:52:22] elukey: please stop [13:52:22] <_joe_> elukey: ahem [13:53:00] <_joe_> ok, I'd say we are ok to go personally [13:53:15] <_joe_> should we wait for 15:00 ? [13:53:17] I think so too, but holding until the actual maintenance time. [13:53:19] Yes. [13:53:25] <_joe_> booo :D [13:53:30] I have a planned maintenance scheduled to go up in statuspage [13:53:35] I'd rather respect it [13:53:37] :p [13:53:38] yeah let's wait for 14:00 utc [13:53:40] <_joe_> yes yes I'm joking [13:53:45] ik ik [13:53:52] stick to the plan :P [13:53:57] <_joe_> I was playing on the cowboy theme [13:54:04] yeehaw [13:54:05] <_joe_> but, I was serious on the GO [13:54:09] not the best time ? [13:54:12] <_joe_> I think we're set [13:54:19] el-p [13:54:22] claime: ah snap sorry, I retried since it was failed and realized [13:54:25] akosiaris: it's ok, I'm deep breathing [13:54:28] :P [13:56:07] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1007.eqiad.wmnet with OS bullseye [13:56:13] done :) [13:56:18] ack [13:56:58] is anyone recording listen to wikipedia ? [13:57:13] <_joe_> no I'm just listening [13:57:21] <_joe_> it's the best feedback about ro-mode [13:57:26] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [13:57:28] <_joe_> without needing to actually try an edit [13:57:54] T-3 minutes, final GO/NOGO check before read-only [13:58:03] I am ready [13:58:03] <_joe_> and btw, the trick is to also select one wiki per section, so I added wikidata, itwiki, frwiki, dewiki, eswiki, ruwiki [13:58:39] _joe_: I have enwiki [13:58:45] And commons [13:59:34] Everybody set ? [13:59:37] yep [13:59:39] ready [13:59:57] <_joe_> so the read-only set will take 15-30 seconds to propagate, but we can proceed with setting the dbs readonly [14:00:03] here we go [14:00:04] claime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Datacenter Switchover - Mediawiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400). [14:00:10] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:00:10] !log cgoubert@cumin1001 MediaWiki read-only period starts at: 2023-03-01 14:00:10.075167 [14:00:22] <_joe_> ah nevermind, it checks itself [14:00:29] silence [14:00:32] <_joe_> silence here too [14:00:39] same [14:00:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:00:40] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:00:40] <_joe_> well almost silence [14:00:41] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:00:42] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:00:58] ^ expected [14:01:02] wikitech is on s6 [14:01:06] <_joe_> yes, sigh [14:01:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:01:16] <_joe_> we forgot this [14:01:16] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:16] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:01:17] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:29] <_joe_> switching mediawiki [14:01:42] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: apt: improve apt failover orchestration - https://phabricator.wikimedia.org/T330849 (10Aklapper) [14:01:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:01:57] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:01:57] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:59] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:00] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:02:01] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:02] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:02:03] cgoubert@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:09] sound [14:02:09] !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-03-01 14:02:09.272468 [14:02:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:02:10] sounds [14:02:12] enwiki codfw master receiving writes [14:02:14] Out [14:02:20] * claime breathes [14:02:23] <_joe_> wooo [14:02:24] same with commons [14:02:27] edit when trough on eswiki (s7) too [14:02:36] Starting post-RO steps [14:02:37] <_joe_> wikidata too [14:02:43] 👏 👏 👏 [14:02:44] <_joe_> s3 and s5 and s4 too [14:02:44] 119s of RO time [14:02:47] niiiice [14:02:51] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [14:02:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [14:03:00] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:03:02] not finished [14:03:23] anyone monitoring fatals? [14:03:26] <_joe_> jynus: what? [14:03:39] _joe_: I mean we are not done and not celebrate early [14:03:52] don't break the mood, we know [14:03:58] :-) [14:04:05] We're out of the hairy part though [14:04:05] <_joe_> fatals are goiing down [14:04:10] <_joe_> yes we are [14:04:15] <_joe_> and the latency is ok too [14:04:17] _joe_: thanks [14:04:17] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:04:28] <_joe_> uhm [14:04:31] <_joe_> jobrunners [14:04:33] POSTS going up on appservers [14:04:39] <_joe_> let me check the jobrunners for a sec [14:04:43] the dashboard to check just in case https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors [14:04:43] _joe_: envoy restarts I bet [14:04:52] wikitech edits work fine, and its weird job running setup works too [14:04:56] Documentation says to expect 500s [14:05:32] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:05:36] It was just a spike of Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes now gone [14:05:39] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [14:05:39] <_joe_> jobs have moved to codfw [14:05:48] [for later] output of 08-start-maintenance could be improved :-P [14:05:49] <_joe_> jynus: yeah just a bit late [14:05:57] insertation works but I'm not seeing processing yet https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [14:06:01] SAL working fine [14:06:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [14:06:10] Once TTLs are restored, I'll merge the DNS change [14:06:20] (03PS2) 10Clément Goubert: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) [14:06:31] (03CR) 10Clément Goubert: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:06:37] waiting on jenkings [14:06:39] -g [14:06:57] <_joe_> Amir1: the graphs are broken for some reason [14:06:58] Someone needs to check why SAL works fine but there's no response to !log irc commands [14:07:03] sigh [14:07:13] <_joe_> it's tcpircbot I guess marostegui [14:07:16] I guess hard-coded to eqiad, maybe [14:07:16] tcpircbot probably needs a restart [14:07:19] the job dashboard works if you switch to the codfw dashboard [14:07:20] +1 on taht guess [14:07:21] _joe_: yeah [14:07:27] I 'll handle that [14:07:30] no, that's stashbot expected behaviour to reduce spam here [14:07:31] thanks akosiaris [14:07:31] !log test [14:07:32] <_joe_> thanks akosiaris [14:07:33] that == tcpircbot [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:45] <_joe_> ah I see [14:07:47] it shows the success message for humans, but for bots it's errors only [14:07:50] ah, here we are. thanks [14:07:57] ah cool, problem solved thanks taavi [14:08:02] (03CR) 10Clément Goubert: [C: 03+2] db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:08:18] !log Phase 9.5 Update DNS records for new database masters - T327920 [14:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:24] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:08:30] claime: I will change pcX later, not important [14:09:02] marostegui: ack [14:09:07] <_joe_> Amir1: can you check logstash for unexpected mw errors? [14:09:13] maybe we should have a script to create the dns change [14:09:15] <_joe_> I'm taking a look at the cluster's health [14:09:16] sure [14:09:17] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:09:19] I will check redlinks, category updates or transcodes for job execution [14:09:56] !log Phase 9.5 DNS records for new database masters updated - T327920 [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:04] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [14:10:08] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-30m&to=now&viewPanel=9 lol amazing [14:10:09] transcodes seems to be happening well [14:10:13] _joe_: nothing [14:10:15] I screen captured btw listen to wikipedia, it will be a nice addition to the email :-) [14:10:26] akosiaris: <3 [14:10:34] I'll be able to relisten to it :D [14:10:36] codfw masters they all seem stable [14:10:37] <_joe_> wow no latency increase at all [14:10:39] Nice memory [14:10:47] [for later] the bach_size for the pupper run on DB hosts can be increased from the current 5 [14:10:50] <_joe_> thanks, multidc mediawiki [14:10:54] yeah, that is way way better [14:10:59] _joe_: without warmup too [14:11:03] are the "MySQL server has gone away" errors for GrowthExperiments known/expected? [14:11:15] <_joe_> I guess not [14:11:20] <_joe_> taavi: link? [14:11:22] they've been going for longer than the switchover fwiw [14:11:25] taavi: they've been there for a while [14:11:28] <_joe_> ah I see [14:11:31] <_joe_> ok [14:11:38] <_joe_> so working as expected(TM) [14:11:40] https://logstash.wikimedia.org/goto/195ec9c292098639e5fe4884d38fcf53 [14:11:42] recategorizations also working well and fast see no obviou job issue atm [14:11:50] ah [14:12:03] "correctly broken" :) [14:12:05] taavi: We still need to get someone to look at them, but they aren't related to the switch [14:12:22] appserver/api_appserver/parsoid graphs looking healthy [14:12:44] and the thread deadlocks are expected too [14:12:51] like not expected but known [14:13:04] puppet run on db still going btw [14:13:07] <_joe_> marostegui: uhm they appeared after the switch though to be more frequent [14:13:12] I am going to reduce db2122's weight a bit, as it is having a spike of load [14:13:20] IIRC, we used to need some rebalancing of databases after each switchover, is that still a case ? [14:13:27] akosiaris: no, because we have multidc [14:13:30] and marostegui was faster than my question :-) [14:13:50] marostegui: your rebalancing begs to differ btw :P [14:13:54] <_joe_> akosiaris: it's possible that some rw-traffic changes slightly things [14:14:09] <_joe_> but we're not in a situation on the cliff [14:14:11] plus job queue backlog [14:14:12] but it should be definitely way way better now (in theory) [14:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2122 weight', diff saved to https://phabricator.wikimedia.org/P44913 and previous config saved to /var/cache/conftool/dbconfig/20230301-141414-marostegui.json [14:14:25] <_joe_> akosiaris: we don't have ES on fire, for instance [14:14:30] yes [14:14:36] I eagerly await work on T265386 [14:14:36] T265386: Make LoadMonitor server states more up-to-date and respond to outages more quickly - https://phabricator.wikimedia.org/T265386 [14:14:53] don't worry we will have all of this once we plan to repool eqiad after being depooled for a month [14:15:03] _joe_: Also, no action needed for ES [14:15:07] Like, none at all. [14:15:11] <_joe_> claime: exactly [14:15:20] <_joe_> in the past they'd be onfire for 10-15 minutes [14:15:26] Amir1: we are pooling it read only in a week [14:15:28] <_joe_> with nice consequences for the appservers [14:15:38] dowmtime removal in progres [14:15:38] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [14:15:45] 1 week fully on codfw, 7 weeks multidc with eqiad being the secondary, that's the plan [14:15:46] And we're done with the cookbook now [14:15:59] akosiaris: oh clever [14:16:04] !log Removing scap lock - T327920 [14:16:07] now yes, great job [14:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:10] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:16:19] great work y'all :) [14:16:25] are we already good to resolve the status page "incident"? [14:16:27] Merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/892428 [14:16:40] (03CR) 10Clément Goubert: [C: 03+2] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:16:50] (03PS1) 10Marostegui: wmnet: Update pcX DNS [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920) [14:16:53] <_joe_> claime: ah that's a nice touch [14:17:07] <_joe_> now you have to scap it :D [14:17:09] thank legoktm for adding it to the procedure :D [14:17:24] (03Merged) 10jenkins-bot: debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [14:17:28] <_joe_> claime: ok with resolving the incident? [14:17:34] Yes. [14:17:49] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) It happend. The next step, next week: debrief the process. [14:17:51] I'll check your patch Manuel [14:17:54] backporting change [14:17:57] Amir1: thanks, no rush [14:18:24] !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] [14:19:04] k8s build is taking a long time *blows through nose* [14:19:08] (03CR) 10Ladsgroup: [C: 03+1] "new values are correct." [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [14:19:10] question, is eqiad depooled? [14:19:17] jynus: at what layer? [14:19:26] because I see almost no tls connections in mysql [14:19:36] It's traffic depooled since yesterday [14:19:39] it's depooled at traffic and mw and multiple service layers [14:19:40] that's expected, eqiad is doing almost nothing right now [14:19:47] It's chillin' [14:19:52] claime: it's celebrating [14:19:53] enjoying its rest [14:20:06] I see, thanks, this is the graph that I saw not going back to normal [14:20:17] jynus: you got 1 week to wreak havoc in whatever you want in eqiad. in 1 week we repool it as readonly [14:20:22] 🍾 🎉 [14:20:30] !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:20:31] https://grafana.wikimedia.org/goto/9xs6AJxVk?orgId=1 [14:20:41] I always felt like this time is like when a very busy airport is shut down for maintenance, now it's time to do all sorts of crazy [14:20:43] Checking email flow [14:21:01] (03CR) 10Marostegui: [C: 03+2] wmnet: Update pcX DNS [dns] - 10https://gerrit.wikimedia.org/r/893479 (https://phabricator.wikimedia.org/T327920) (owner: 10Marostegui) [14:21:10] I have a couple of schema changes I want to run there [14:21:13] mwhahahahahahah [14:21:59] <_joe_> Amir1: let's wait for tomorrow maybe [14:22:01] Email flowing [14:22:29] eventstreams flowing [14:22:52] scap is finishing helmfile apply, and then we're done [14:23:02] can I re-start a mw maintenance script that I was running? or do you want me to wait a bit? [14:23:14] Give it a sec I'm running a backport [14:24:08] (I know it shouldn't conflict but I'm more comfortable that way <3) [14:24:19] claime: pcX dns changes merged and deployed [14:24:27] marostegui: awesome, thank you <3 [14:25:14] It's restarting php-fpm, 65% done [14:25:51] are cookbook runs good to go or need more time for checks etc? [14:26:18] !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] (duration: 07m 54s) [14:26:23] I'm done. [14:26:24] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [14:26:35] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:27:03] !log End mediawiki datacenter switchover - T327920 [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:30] \o/ \o/ \o/ \o/ congrats and nicely done everyone [14:27:36] !log re-start persistRevisionThreadItems.php on itwiki from P44912 after DC switchover T315510 [14:27:37] great work [14:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:41] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:28:10] (03PS1) 10Hashar: Revert "ci: Permit ES traffic from jenkins masters to relforge" [puppet] - 10https://gerrit.wikimedia.org/r/893457 [14:28:23] (03CR) 10CI reject: [V: 04-1] Revert "ci: Permit ES traffic from jenkins masters to relforge" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (owner: 10Hashar) [14:28:33] moritzm: taavi you can go ahead [14:28:42] ack, thx [14:29:39] zabe: you probably need to restart your scripts too, from mwmaint200x [14:29:50] great work folks! [14:29:55] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-canary [14:30:28] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1006.eqiad.wmnet with OS bullseye [14:30:37] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1008.eqiad.wmnet with OS bullseye [14:30:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-canary [14:32:47] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw [14:33:21] (03PS2) 10Hashar: elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) [14:33:28] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [14:34:52] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web [14:34:54] (03CR) 10Hashar: "The iptables rules from Jenkins to Relforge Elastic search are no more used. It was a one off experiment back in 2014/2015 :) The rules a" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [14:37:20] marostegui: okay if I do some schema changes on eqiad masters? [14:40:17] 10SRE: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T330881 (10Serviziperinternet) [14:40:34] Amir1: We said no db maintenance till monday [14:40:56] Amir1: also, eqiad -> codfw replication is still enabled (that's why) [14:42:01] ah I forgot, sorry. It wasn't planned to replicate but I see, nothing urgent [14:45:11] !log dcaro@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1005 [14:45:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw [14:45:24] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad [14:47:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:46] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: scrape pint [puppet] - 10https://gerrit.wikimedia.org/r/893466 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:48:21] (03PS1) 10David Caro: harbor: move to epp template for the config file [puppet] - 10https://gerrit.wikimedia.org/r/893480 [14:48:23] (03PS1) 10David Caro: harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 [14:48:24] Thank you all for helping make this a really smooth switchover <3 [14:49:42] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482 [14:49:44] jouncebot: nowandnext [14:49:44] For the next 0 hour(s) and 10 minute(s): Datacenter Switchover - Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1400) [14:49:44] In 3 hour(s) and 10 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800) [14:49:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:47] (03CR) 10CI reject: [V: 04-1] harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 (owner: 10David Caro) [14:52:21] !log dcaro@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1005 [14:53:09] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:28] FYI if you re-run the wmf-update-known-hosts-production you get the DB master known hosts updated ;) [14:54:51] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482 (owner: 10Hnowlan) [14:56:03] (03PS3) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) [14:56:39] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) (owner: 10David Caro) [14:57:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad [14:58:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:29] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/893482 (owner: 10Hnowlan) [14:59:40] (03PS1) 10Hashar: contint: manage dsh target from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893483 [14:59:42] (03PS1) 10Hashar: contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) [14:59:44] (03PS1) 10Hashar: releases: manage jenkins-rel dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893485 (https://phabricator.wikimedia.org/T323909) [15:00:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:55] (03PS1) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 [15:01:03] PROBLEM - IPMI Sensor Status on ml-cache1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:02:11] (03CR) 10Hashar: "I am not sure who can best review this change to how the dsh targets are generated. Giuseppe has introduced the pattern in https://gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/893483 (owner: 10Hashar) [15:02:23] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:02:35] (03CR) 10CI reject: [V: 04-1] Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff) [15:04:08] (03PS2) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 [15:04:51] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005'] [15:06:10] !log Restarting Apache on Gerrit host [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] (03PS3) 10Muehlenhoff: Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 [15:08:21] (03PS2) 10David Caro: harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 [15:08:43] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) [15:09:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1006.eqiad.wmnet with OS bullseye [15:09:18] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [15:09:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:47] !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1005'] [15:12:23] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:12:30] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:13:45] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:13:52] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) [15:14:02] (03CR) 10Muehlenhoff: "If there's preference for a dedicated/different category other than "misc-clusters", happy to amend." [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff) [15:17:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:15] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-canary [15:20:14] (03CR) 10JMeybohm: [C: 03+1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:20:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-canary [15:21:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:18] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005'] [15:23:29] (03PS1) 10Muehlenhoff: sre.elasticsearch.restart-nginx: Fix typo which breaks aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/893490 [15:26:11] 10SRE, 10SRE-tools, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) [15:26:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:26:49] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:50] (03CR) 10Muehlenhoff: [C: 03+2] sre.elasticsearch.restart-nginx: Fix typo which breaks aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/893490 (owner: 10Muehlenhoff) [15:28:11] !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005'] [15:30:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:21] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-codfw [15:35:36] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005'] [15:35:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:12] 10SRE, 10Citoid: citoid having stability issues - https://phabricator.wikimedia.org/T330768 (10JMeybohm) IIRC it's pretty common for citoid to get OOM killed from time to time and that that is kind of expected. [15:39:05] (03PS1) 10Jbond: P:confd: Add support for discovery facts [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849) [15:39:21] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:39:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:39:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:09] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:41:21] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:34] (03CR) 10Bking: [C: 03+1] elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [15:44:45] !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005'] [15:44:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:10] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10JMeybohm) [15:46:47] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:10] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10JMeybohm) Welcome @Serviziperinternet! As of https://wikitech.wikimedia.org/wiki/Maps/External_usage "//maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Affiliates... [15:49:48] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:24] (03CR) 10Hashar: "recheck after deployment of https://gerrit.wikimedia.org/r/c/integration/config/+/893416" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:57:18] !log dcaro@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005'] [15:57:45] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005'] [16:00:17] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1005.eqiad.wmnet with OS bullseye [16:01:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-codfw [16:02:10] !log cr[23]-esams: manually adding brett's ssh-rsa to match https://gerrit.wikimedia.org/r/c/operations/homer/public/+/892551 [16:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:10] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:05:45] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jhathaway) @jbond SRV support does look interesting, it appears they did some work to make it more production ready, https://tickets.puppetlabs.com/browse/PUP-7550. There are a couple of open... [16:10:40] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jhathaway) @MoritzMuehlenhoff & @jbond thanks for putting together this plan. I think the plan sounds really sensible. I am particularly curious as to how robust the backward compatibility is... [16:11:29] (03CR) 10JHathaway: "Would love if you could take another look at this when you have a moment." [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:12:02] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:15:16] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:15:42] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:16:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:10] jouncebot: nowandnext [16:16:10] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [16:16:10] In 1 hour(s) and 43 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800) [16:16:30] (03PS2) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) [16:16:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [16:17:16] (03PS2) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545) [16:17:30] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:17:32] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:17:36] (03Merged) 10jenkins-bot: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [16:17:43] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:17:58] !log taavi@deploy2002 Started scap: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]] [16:18:03] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [16:19:39] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:19:43] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:20:00] !log taavi@deploy2002 taavi: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [16:20:19] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:20:20] seeing T242031 getting work is exciting! [16:20:30] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:20:47] PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-aptrepo-apt2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:09] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:21:11] (03PS3) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545) [16:21:16] TheresNoTime: if you want to see that moving forward, reviews on https://gerrit.wikimedia.org/r/c/873892 would be very much appreciated [16:21:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:53] (03CR) 10Ahmon Dancy: [C: 03+1] scap bootstrap: use new installation mechanism [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [16:25:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:21] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:891833|Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD (T242031)]] (duration: 08m 23s) [16:26:30] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [16:26:37] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) 05Open→03Stalled [16:28:26] !log rollback port 80 block in esams - T330683 [16:28:27] !log Remove dns3001 DNS request routing via juniper - T321309 [16:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:35] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:30:08] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) Please do not delete templates but fill them out: **Link to site**: ... **Purpose/details about your project**: ... **Wikimedia Affiliate supporting project**: ... [16:30:11] (03PS1) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502 [16:36:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:07] (03PS1) 10Filippo Giunchedi: Address problems found by 'pint' [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182) [16:36:09] (03PS1) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) [16:37:07] (03CR) 10CI reject: [V: 04-1] Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:37:10] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10bd808) >>! In T330847#8656124, @Marostegui wrote: > would Thursday 9th at 16:00 UTC work for you all? That date and time work for me. [16:39:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:56] (03CR) 10Elukey: [C: 03+2] kserve: add replicas setting for Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/893417 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [16:42:58] (03PS2) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) [16:45:45] 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Vgutierrez) [16:46:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:19] 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10hashar) [16:52:25] (03PS2) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502 [16:52:27] (03PS1) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 [16:56:35] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1005.eqiad.wmnet with OS bullseye [16:57:02] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10dcaro) cc. @Raymond_Ndibe in case you want to try maintaindbusers at that time (uses labsdbaccounts) [16:57:32] (03PS2) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 [16:58:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I have tried to limit the cases where we use confd data to influence puppet runs, because that will propagate with even 30 minutes of dela" [puppet] - 10https://gerrit.wikimedia.org/r/893502 (owner: 10Jbond) [16:59:04] <_joe_> jbond: I might have misunderstood the intentions of your patch, but I hope my reservations are clear enough [16:59:26] <_joe_> basically we have a couple places where puppet runs depend on conftool state and I dread it [16:59:43] <_joe_> I fear that such functions would enable doing it more [17:00:08] _joe_: just about to to jump into a meeting, migt ping yuo to chat about it tomorrow, but yes that is exactly what i was trying to do :) [17:01:30] <_joe_> absolutely let's talk tomorrow :) [17:01:36] <_joe_> (I'm also in a meeting) [17:01:47] cool thanks ill ping yuo tomorrow [17:05:07] !log dcaro@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1005'] [17:06:43] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1005'] [17:17:11] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff) [17:19:49] (03PS1) 10Elukey: admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542) [17:19:53] (03PS1) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) [17:20:14] (03CR) 10CI reject: [V: 04-1] labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:21:12] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10lbowmaker) [17:21:19] (03PS2) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) [17:21:53] (03PS3) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 [17:24:22] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ml-serve1006.eqiad.wmnet with OS bullseye [17:24:47] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1005.eqiad.wmnet with OS bullseye [17:25:05] (03PS4) 10Jbond: aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 [17:25:49] cdanis: nothing to report (aokoth doeen't apear to be here at th emoment) [17:25:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/893514/39893/labstore1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:25:56] jbond: <3 [17:26:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39894/console" [puppet] - 10https://gerrit.wikimedia.org/r/893510 (owner: 10Jbond) [17:26:12] !log root@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade to k8s 1.23 [17:27:26] (03PS3) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) [17:27:55] (03PS2) 10Elukey: admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542) [17:35:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:07] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@9568478]: Deploy Airflow upgrade branch for analytics_test [17:36:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@9568478]: Deploy Airflow upgrade branch for analytics_test (duration: 00m 05s) [17:38:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] aptrepo: fix linting issues and docs [puppet] - 10https://gerrit.wikimedia.org/r/893510 (owner: 10Jbond) [17:38:36] (03CR) 10David Caro: "This will stop being applied to labstore1004, as it will stop having maintaindbusers in it no?" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:38:59] (03PS3) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502 [17:40:20] 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Ennomeijers) Thanks for the replies! Advising to use HTTPS over HTTP makes sense. But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata a... [17:40:53] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:55] (03PS4) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 [17:41:13] (03CR) 10Raymond Ndibe: "Hello arturo, thanks for helping out with this! it wasn't exactly obvious where this change was to be added. I have one small question, th" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:41:22] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1005.eqiad.wmnet with reason: host reimage [17:41:28] (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett) [17:43:12] (03CR) 10David Caro: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:44:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1005.eqiad.wmnet with reason: host reimage [17:45:37] (03CR) 10Raymond Ndibe: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:46:40] (03CR) 10David Caro: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [17:46:55] (03Abandoned) 10Andrew Bogott: OpenStack: rename 'user' role to 'member' [puppet] - 10https://gerrit.wikimedia.org/r/893036 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:47:00] (03PS5) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 [17:48:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:32] 10SRE, 10Traffic: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Nikki) I've noticed in the past few days that when I enter "wikidata.org" on my phone (using Vivaldi), it's sometimes really slow to load, but will load straightaway if I edit the URL to add https://. I do... [17:52:24] (03PS1) 10ArielGlenn: make sure all of dumpsdata1001-7 permit rsync from/to each other [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) [17:57:57] (03PS3) 10Bking: elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [17:58:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1800) [18:00:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS buster [18:01:38] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [18:02:35] (03PS1) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:07:27] (03PS2) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:08:41] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:31] PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:27] (03PS3) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:11:11] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:11:47] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:12:07] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:12:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [18:12:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1005.eqiad.wmnet with OS bullseye [18:12:23] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:27] RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 81.03 ms [18:14:26] (03CR) 10Bking: [C: 03+2] elastic relforge: rm rules for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/893457 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [18:15:31] PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:47] (03PS4) 10Jbond: wmflib::discovery::pooled_site: funtion to discover poled sites [puppet] - 10https://gerrit.wikimedia.org/r/893502 [18:19:06] (03CR) 10Jbond: [C: 04-1] "adding the -1 back until discussed" [puppet] - 10https://gerrit.wikimedia.org/r/893502 (owner: 10Jbond) [18:19:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:19] RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 81.06 ms [18:20:37] PROBLEM - Bird Internet Routing Daemon on dns3001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:21:21] PROBLEM - Check systemd state on dns3001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:23] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns3001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [18:22:22] (03PS4) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:22:25] RECOVERY - Bird Internet Routing Daemon on dns3001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:22:45] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:59] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:23:09] RECOVERY - Check systemd state on dns3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:11] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns3001 is OK: OK: UP (pid=2568) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [18:23:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39899/console" [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [18:23:53] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:30] (03PS5) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:27:42] (03PS6) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:28:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39901/console" [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [18:32:40] (03CR) 10Cwhite: [C: 03+2] logstash: remove SEVERITY_LABEL from syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/890363 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [18:33:42] (03PS3) 10Cwhite: profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 [18:37:06] (03PS7) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) [18:44:12] (03CR) 10Cwhite: [C: 03+2] profile: move phatality resources from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [18:44:49] (03CR) 10Jbond: P:aptrepo: use new wmflib::discovery::pooled_sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893522 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [18:45:33] (03CR) 10ArielGlenn: "pcc output looks reasonable. https://puppet-compiler.wmflabs.org/output/893519/39903/" [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [18:48:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SRamkisson) Approved [18:48:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3001.wikimedia.org with OS bullseye [18:49:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3001.wikimedia.org with OS bullseye [18:50:23] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:53:45] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:53:55] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:53] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:55:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:51] (03CR) 10Dzahn: [C: 03+2] define role owner for gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/892587 (owner: 10Dzahn) [18:56:49] PROBLEM - Host 2620:0:862:1:91:198:174:61 is DOWN: PING CRITICAL - Packet loss = 100% [19:00:05] jnuche and hashar: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T1900). [19:00:10] (03CR) 10Dzahn: "ooh, I see! Well, I am glad I made an exception and did not try to switch this one without asking first. But let me get back to this soon." [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [19:00:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:05:06] (03CR) 10Dzahn: "Thank you! I did not expect this and glad I asked. This makes sense to me now." [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:05:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:57] (03PS2) 10Dzahn: devtools: change gerrit hostname to use wmcloud, not wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) [19:09:28] (03CR) 10Dzahn: "This is now waiting for T330312. Once the instance is running again we can merge this and check everything is ok." [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [19:09:39] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3001.wikimedia.org with reason: host reimage [19:12:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3001.wikimedia.org with reason: host reimage [19:25:38] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:00] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:12] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:27:24] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:30:16] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:30:38] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:30:46] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:31:54] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:16] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:26] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:36:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3001.wikimedia.org with OS bullseye [19:37:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3001.wikimedia.org with OS bullseye completed: - dns3001 (**PASS**) - Downtimed on Icinga/Al... [19:39:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) @LSobanski did you have the final Server figured out? [19:47:48] !log re-adding dns3001 to next-hop routing via juniper - T321309 [19:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:55] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [19:50:55] (03PS1) 10BCornwall: Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 [19:51:00] (03PS2) 10BCornwall: Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 [19:51:36] (03CR) 10Ssingh: [C: 03+1] Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 (owner: 10BCornwall) [19:52:48] (03CR) 10BCornwall: [C: 03+2] Revert "ntp/esams: set to dns3002" [dns] - 10https://gerrit.wikimedia.org/r/893550 (owner: 10BCornwall) [19:54:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:03:33] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [20:33:35] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [20:40:16] (03CR) 10Dzahn: [C: 03+2] "thanks! looks good to me. https://puppet-compiler.wmflabs.org/output/887738/39904/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [20:41:58] (03PS6) 10Dzahn: contint: regroup common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [20:43:12] !log move rev_comment_id migration screens from mwmaint1002 to mwmaint2002 # T275246 [20:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:18] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [20:44:18] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/887738/39905/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [20:51:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "confirmed this was a noop on contint1002 and contint2002" [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [20:51:57] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) [20:52:34] (03CR) 10Dzahn: [C: 04-1] "seems like there is still discussion about this on a mailing list" [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper) [20:53:19] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) Quote:{F36887400} [20:54:32] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) [20:54:36] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10jhathaway) Quote:{F36887400} [20:54:42] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Security, 10WMF-NDA: Re-establish DMARC reporting analysis - https://phabricator.wikimedia.org/T330944 (10taavi) [20:55:24] (03CR) 10Dzahn: "I am aware there might be more discussion waiting for how and where this should be hosted.. but on the other hand.. making this specific D" [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis) [20:55:50] PROBLEM - Host dns2002 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:58] (03CR) 10Dzahn: "let's add John for his opinion on this" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:58:24] RECOVERY - Host dns2002 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [20:58:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:59:14] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:59:14] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:59:38] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230301T2100). [21:00:05] Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] * TheresNoTime can deploy [21:00:34] Uh I completely forgot about it lol [21:00:38] Thanks TheresNoTime :P [21:01:00] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:01:00] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:01:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [21:01:24] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:02:03] (03Merged) 10jenkins-bot: [trwiki] Reverting logo change for Vector 2022 and Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893089 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [21:02:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:02:21] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2002.wikimedia.org with OS bullseye [21:02:26] !log samtar@deploy2002 Started scap: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]] [21:02:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2002.wikimedia.org with OS bullseye [21:02:32] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [21:04:13] !log samtar@deploy2002 superpes and samtar: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:04:25] Superpes: can you test? :) [21:04:29] Looking :) [21:05:34] PROBLEM - Host 2620:0:860:4:208:80:153:111 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:4:208:80:153:111) [21:05:37] TheresNoTime It works on both Vector 2022 and Vector legacy :D [21:05:44] syncing :) [21:06:18] Thanks :) [21:06:42] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:07:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:08:06] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:08:06] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:08:24] I assume all the alerts is brett with dns2002 [21:08:41] oh balls, forgot to mention, yes [21:09:26] PROBLEM - Recursive DNS on 208.80.153.111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [21:09:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:09:49] brett: blame dns. Dns can be blamed for everything [21:10:15] It shouldn’t be so noisy [21:10:42] https://www.irccloud.com/pastebin/jnZofe81/ [21:11:00] Heh [21:11:06] I might have that framed TheresNoTime [21:11:34] :D [21:11:56] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:893089|[trwiki] Reverting logo change for Vector 2022 and Vector legacy (T329047)]] (duration: 09m 30s) [21:12:02] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [21:12:11] Superpes: live, can you confirm? [21:12:48] Yep confirm! Many thanks TheresNoTime :D [21:13:12] o7 [21:14:32] RECOVERY - Host 2620:0:860:4:208:80:153:111 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [21:14:42] * TheresNoTime will be around for another 15 minutes or so if there's any other patches needing deployment [21:16:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2002.wikimedia.org with reason: host reimage [21:18:08] PROBLEM - Recursive DNS on 2620:0:860:4:208:80:153:111 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [21:18:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2002.wikimedia.org with reason: host reimage [21:19:45] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:23:53] !log closing UTC late backport window [21:23:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/889248/39906/" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [21:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:36] !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a]: Regular analytics weekly train [analytics/refinery@d4d723a] [21:28:58] RECOVERY - Recursive DNS on 2620:0:860:4:208:80:153:111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [21:29:20] RECOVERY - Recursive DNS on 208.80.153.111 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [21:30:14] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:31:02] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:33:30] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:33:30] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:35:24] (03PS1) 10Ladsgroup: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) [21:35:40] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:36:46] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Dzahn) Per the description on the front page, which I translated with Google Translate, the goal of this project is to "highlight the business sector,** large companies**", and "**very high level CEOs** bel... [21:37:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:37:35] (03PS1) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) [21:37:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2002.wikimedia.org with OS bullseye [21:37:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2002.wikimedia.org with OS bullseye completed: - dns2002 (**PASS**) - Downtimed on Icinga/Al... [21:38:31] !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a]: Regular analytics weekly train [analytics/refinery@d4d723a] (duration: 10m 55s) [21:39:27] !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a] (thin): Regular analytics weekly train THIN [analytics/refinery@d4d723a] [21:39:34] !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a] (thin): Regular analytics weekly train THIN [analytics/refinery@d4d723a] (duration: 00m 07s) [21:39:51] !log mforns@deploy2002 Started deploy [analytics/refinery@d4d723a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4d723a] [21:40:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:41:14] !log mforns@deploy2002 Finished deploy [analytics/refinery@d4d723a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4d723a] (duration: 01m 22s) [21:42:51] (03PS1) 10BCornwall: ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 [21:43:57] (03PS2) 10BCornwall: ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 [22:06:21] (03PS2) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) [22:06:31] (03CR) 10CI reject: [V: 04-1] toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [22:09:05] (03PS3) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) [22:09:26] (03CR) 10CI reject: [V: 04-1] toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [22:11:37] (03PS4) 10Cwhite: toil: restart opensearch-dashboards every wednesday [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) [22:14:11] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/output/891394/39907/" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [22:16:32] RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:36] PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-aptrepo-apt2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:16] Doing some firmware upgrades and then reimaging on dns1002 [22:37:30] (03PS1) 10Nray: Revert "Add static "Cleopatra" page to facilitate synthetic testing of 885362" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326147) [22:40:28] PROBLEM - Host dns1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:42:09] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@51e92b1]: (no justification provided) [22:42:14] RECOVERY - Host dns1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:42:31] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@51e92b1]: (no justification provided) (duration: 00m 21s) [22:42:39] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns1002.wikimedia.org with OS bullseye [22:42:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns1002.wikimedia.org with OS bullseye [22:43:04] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:43:08] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:43:16] (03CR) 10Dzahn: "There is the systemd service "rsync-aptrepo-apt2001.wikimedia.org" on apt1001. And it fails because it tries to push from 1001 to 2001 but" [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [22:45:28] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@1fb5c4a]: (no justification provided) [22:45:36] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:208:80:155:108) [22:45:52] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@1fb5c4a]: (no justification provided) (duration: 00m 23s) [22:49:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:50:48] PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:51:00] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Dzahn) After the switch of the apt servers we are getting alerting about bad systemd status on apt1001. ` <+icinga-wm> PROBLEM - Check systemd state... [22:51:58] !log apt1001 - systemctl reset-failed T328907 [22:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:04] T328907: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 [22:52:34] RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:45] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1002.wikimedia.org with reason: host reimage [22:57:28] RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [22:58:04] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiness.pro - https://phabricator.wikimedia.org/T330881 (10Aklapper) 05Stalled→03Declined [23:01:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1002.wikimedia.org with reason: host reimage [23:02:24] PROBLEM - Recursive DNS on 2620:0:861:4:208:80:155:108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [23:10:46] RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:11:26] RECOVERY - Recursive DNS on 2620:0:861:4:208:80:155:108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:15:50] (03PS1) 10Andrew Bogott: OpenStack: collapse 'user' OpenStack role into 'reader' role [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) [23:21:10] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:21:14] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:23:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1002.wikimedia.org with OS bullseye [23:23:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns1002.wikimedia.org with OS bullseye completed: - dns1002 (**PASS**) - Downtimed on Icinga/Al... [23:26:06] (03PS1) 10BCornwall: ntp/eqiad: set to dns1002 [dns] - 10https://gerrit.wikimedia.org/r/893566 [23:27:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)