[00:09:04] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [00:39:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929750 [00:39:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929750 (owner: 10TrainBranchBot) [00:53:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:53:20] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:53:44] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:16] PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:57:34] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:58:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929750 (owner: 10TrainBranchBot) [01:05:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [01:18:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:18:46] RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:06] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:12] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:19:14] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:16] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:42] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:27:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:44] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [01:41:09] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:41:38] PROBLEM - WDQS SPARQL on wdqs2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:41:42] PROBLEM - Query Service HTTP Port on wdqs2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [01:41:54] PROBLEM - Check systemd state on wdqs2022 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:26] RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:33] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [01:48:02] PROBLEM - Check systemd state on wdqs2022 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:43] (SystemdUnitFailed) firing: (3) wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:38] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [01:51:44] (SystemdUnitCrashLoop) firing: (2) prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:57:23] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [01:57:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2022.codfw.wmnet with reason: attempting WDQS stack on bullseye [02:03:55] RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:03] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:21] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:20] (03CR) 10Tim Starling: "I manually deployed it to mwdebug1001 by doing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [03:03:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [03:50:41] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:01] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:09] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:09] PROBLEM - Query Service HTTP Port on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:51:19] PROBLEM - WDQS SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:43] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:45] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:43:07] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (people1004, ...), Fresh: 120 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:43:27] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10santhosh) This service is also designed in a way that any person or organization interested can just run it in th... [04:47:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:53:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [05:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:13:42] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) Let me create some stale backups from the server and I will let you know when the server is down and ready for servicing. [05:15:23] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:26:36] (03PS2) 10KartikMistry: Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929741 (https://phabricator.wikimedia.org/T337669) [05:43:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:44:41] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:53:18] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10elukey) The architecture of MinT is moving towards what we offer for Lift Wing, that is a standardized way to pro... [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T0600) [06:13:38] (03PS2) 10Ayounsi: Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) [06:14:07] (03CR) 10Ayounsi: Replace Capirca with Aerleon (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [06:15:08] (03CR) 10Slyngshede: [C: 03+2] P:IDM Switch production server to MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/929699 (owner: 10Slyngshede) [06:21:12] (03PS1) 10Slyngshede: C:idm mising database engine [puppet] - 10https://gerrit.wikimedia.org/r/929944 [06:21:47] (03CR) 10Slyngshede: [C: 03+2] C:idm mising database engine [puppet] - 10https://gerrit.wikimedia.org/r/929944 (owner: 10Slyngshede) [06:23:21] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Mschon) >>! In T337586#8929595, @BCornwall wrote: > Yeah, the document is pretty barren. It sounds like there needs to be a little bit more planning! it looks like the planning... [06:30:54] (03PS1) 10Marostegui: orchestrator.conf: Add test-s4 to automatic recovery [puppet] - 10https://gerrit.wikimedia.org/r/929945 (https://phabricator.wikimedia.org/T322993) [06:31:07] (03PS1) 10Ayounsi: Update Homer and Rancid SSH keys to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/929946 (https://phabricator.wikimedia.org/T336769) [06:31:24] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add test-s4 to automatic recovery [puppet] - 10https://gerrit.wikimedia.org/r/929945 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui) [06:32:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:11] (03PS3) 10Jelto: gitlab: make oauth client identifier configurable [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) [06:37:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41713/console" [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [06:37:17] (03CR) 10Ayounsi: [C: 03+2] Update Homer and Rancid SSH keys to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/929946 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [06:38:39] (03PS1) 10Marostegui: orchestrator.conf: Add intermediate master recovery [puppet] - 10https://gerrit.wikimedia.org/r/929947 (https://phabricator.wikimedia.org/T322993) [06:39:49] (03CR) 10Jelto: [V: 03+1] gitlab: make oauth client identifier configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [06:40:09] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add intermediate master recovery [puppet] - 10https://gerrit.wikimedia.org/r/929947 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui) [06:41:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:46:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:50:00] (03PS1) 10Marostegui: db1124,db1125,db1133: Binlog set to SBR [puppet] - 10https://gerrit.wikimedia.org/r/929948 (https://phabricator.wikimedia.org/T322993) [06:51:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:53:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [06:55:20] (03PS1) 10Gergő Tisza: Structured tasks: Fix toolbar rewriting [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929966 (https://phabricator.wikimedia.org/T338934) [06:55:38] (03PS1) 10Gergő Tisza: Section images: Pass section parameters to VE in add image tasks [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929967 (https://phabricator.wikimedia.org/T339046) [06:56:02] (03PS1) 10Slyngshede: C:idm::deployment add missing packages. [puppet] - 10https://gerrit.wikimedia.org/r/929951 [06:56:32] (03CR) 10CI reject: [V: 04-1] C:idm::deployment add missing packages. [puppet] - 10https://gerrit.wikimedia.org/r/929951 (owner: 10Slyngshede) [06:57:06] (03PS1) 10Muehlenhoff: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) [06:57:32] (03CR) 10CI reject: [V: 04-1] Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:00:06] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T0700). Please do the needful. [07:00:06] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] (03PS2) 10Muehlenhoff: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) [07:00:23] o/ [07:00:32] * taavi assumes kart_ will self-deploy [07:00:39] (03CR) 10CI reject: [V: 04-1] Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:00:41] (03PS2) 10Slyngshede: C:idm::deployment add missing packages. [puppet] - 10https://gerrit.wikimedia.org/r/929951 [07:00:47] 0/ [07:01:29] is someone looking into the "No space left on device" errors? [07:01:33] (03PS3) 10Muehlenhoff: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) [07:01:53] taavi: yeah. I'm going ahead.. [07:02:01] also, are the DBs down? [07:02:10] uh? [07:02:19] tgr_: which errors? which dbs? [07:02:19] Did I hear DBs down? [07:02:25] that's what wikitech claims when I try to edit, at least [07:02:33] [417bc339-b8d0-4728-9004-37c5bbf06112] Caught exception of type Wikimedia\Rdbms\DBConnectionError [07:02:44] Cannot access the database: Cannot access the database: could not connect to any replica DB server; Connection timed out (es1020) [07:03:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929741 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:03:26] tgr_: Can you try again? I just edited wikitech fine [07:03:33] as I said in _security already it's probably a firewall issue, but I don't know what changed that wikitech is suddently trying to connect there [07:03:57] (03Merged) 10jenkins-bot: Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929741 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:04:01] marostegui: I'm still getting that error [07:04:09] :-/ [07:04:18] tgr_: it seems to be working fine for me [07:04:29] !log Test [07:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:36] !log kartik@deploy1002 Started scap: Backport for [[gerrit:929741|Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation (T337669)]] [07:04:37] That also went through [07:04:39] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:04:50] there's also this: https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@dcc6a94&_a=h@5347b78 [07:05:20] (unrelated, it seems to be only affecting mwmaint - out of log space?) [07:05:53] marostegui: e.g. https://wikitech.wikimedia.org/w/index.php?title=Add_Image&oldid=prev&diff=1950552&markasread=864026 [07:06:02] um, indeed [07:06:06] /dev/mapper/mwmaint1002--vg-root 102G 96G 0 100% / [07:06:13] !log kartik@deploy1002 kartik: Backport for [[gerrit:929741|Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation (T337669)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:06:29] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: expire_bitu_signups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:13] could someone who can edit wikitech add https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/929966 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/929967 to the backport window? [07:07:40] Amir1: Can you clean up some of your /home in mwmaint1002 it is 52G at the moment so half to the partition size :) [07:07:47] I pruned 5G from my ~ to immediately unbreak things [07:08:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:09:01] Yeah, wikitech is definitely giving intermittent errors [07:09:23] I don't see anything wrong with the databases, there's not even lag there [07:09:49] marostegui: I'm not sure of the details, but my suspicion is that it has something to do with Amir running moveToExternal.php [07:09:59] the DB errors: https://logstash.wikimedia.org/goto/de234ca07619e6eb18ed3d4aee80aafe [07:10:09] very modest amount, and wikitech only [07:10:25] Do you both feel I should page him or can this wait? [07:10:39] As I don't know anything about moveToExternal.php :) [07:10:39] unlike other wikis, wikitech hosts are in the public VLAN so they need special firewall rules to access any prod databases. right now they exist for s6 only, I presume it wasn't using es before [07:11:15] I am going to create a task [07:11:30] (03PS3) 10KartikMistry: testwiki: Enable Section Translation for 3 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929621 (https://phabricator.wikimedia.org/T338123) [07:13:59] Created https://phabricator.wikimedia.org/T339079 [07:14:34] thanks [07:15:23] I think this definitely needs to be looked into soon, but not sure if it warrants paging someone [07:16:20] I will wait till 08:00 AM UTC [07:16:26] And if not I'll page him [07:16:30] (so 45 minutes) [07:16:33] does that sound good? [07:16:51] sgtm [07:16:53] marostegui: right now, it seems to be creating <50K/min (so a few megabytes per hour) so it seems like that 5G will last a while [07:17:12] (of course no idea if it really produces logs at a linear rate) [07:18:12] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:929741|Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation (T337669)]] (duration: 13m 35s) [07:18:16] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:18:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:18:41] is T339079 really related to moveToExternal? I thought taavi said that about the disk space issue [07:18:42] T339079: wikitech intermittent DB errors - https://phabricator.wikimedia.org/T339079 [07:19:07] tgr_: Ah, then I got confused. taavi can you confirm? [07:19:41] I don't see how mwmaint running out of disk space would produce 'could not connect to any replica DB server; Connection timed out' errors on completely separate servers [07:19:41] !log kartik@deploy1002 Backport cancelled. [07:19:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929621 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:20:00] so yes, I'm pretty confident that these are two separte errors [07:20:28] removed that line from the task [07:20:31] thank you both [07:21:15] sorry, no I mean that I do think that the ongoing moveToExternal.php stuff is probably causing the wikitech issues, and then the disk space issues are separate [07:21:17] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 3 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929621 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [07:21:36] ok [07:21:47] !log kartik@deploy1002 Started scap: Backport for [[gerrit:929621|testwiki: Enable Section Translation for 3 Wikipedias (T338123)]] [07:21:50] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [07:22:08] Yeah, I was super confused about the error related to mwmaint1002 but I was like...maybe I am missing something [07:23:18] (03CR) 10Marostegui: [C: 03+1] Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:23:30] !log kartik@deploy1002 kartik: Backport for [[gerrit:929621|testwiki: Enable Section Translation for 3 Wikipedias (T338123)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:23:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:23:42] the mwmaint error could be caused by moveToExternal producing lots of logs. Could be any other long-running script of course. [07:24:06] oh true [07:24:35] I don't get how moveToExternal would affect whether wikitech can access the DB on a web request. [07:24:54] I am glad I am not the only one confused here :) [07:24:55] wikitech doesn't use ExternalStore at all, does it? [07:25:03] tgr_: it's really fun [07:25:13] tgr_: well running that script causes wikitech suddently to use externalstore when it did not before [07:25:22] It's starting to use them [07:25:25] and because it did not use it before, the firewall rules and database grants to do so are missing [07:26:02] We shouldn't store blobs of edit content in core tables [07:26:18] Thanks for mwmaint. I compress the logs [07:29:00] (03CR) 10Muehlenhoff: C:idm::deployment add missing packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929951 (owner: 10Slyngshede) [07:29:56] so I guess whether or not you can edit wikitech depends on which page you happen to try editing? [07:30:31] and the caches [07:31:03] (03PS1) 10Majavah: P:mariadb: allow wikitech to connect to es* [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) [07:31:35] although I get a DB timeout for any page, it seems [07:31:41] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:929621|testwiki: Enable Section Translation for 3 Wikipedias (T338123)]] (duration: 09m 54s) [07:31:45] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [07:32:04] I'm done with my patches.. [07:32:46] !log test [07:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41714/console" [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:33:33] I am confused. I can't edit the SAL page but apparently stashbot can. [07:33:53] some kind of IP based DB load balancing and I happen to be on an unlucky IP? [07:35:14] (03CR) 10Ladsgroup: "It looks good to me but before merge, let me check if Manuel is happy with it." [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:35:29] (03CR) 10Ladsgroup: "I'd deploy the grant changes on es hosts" [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:36:09] (03PS4) 10Muehlenhoff: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) [07:37:48] (03CR) 10Marostegui: "I really don't like this hack (including the one for s6), but I guess we have no other option for now." [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:39:27] (03CR) 10Ladsgroup: P:mariadb: allow wikitech to connect to es* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:40:12] (03CR) 10Majavah: [V: 03+1] P:mariadb: allow wikitech to connect to es* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:40:35] !log ayounsi@cumin2002 START - Cookbook sre.network.debug for Netbox circuit ID 29 [07:40:42] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 29 [07:42:21] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:49] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:46:52] !log backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/929966 (can't edit wikitech due to DB issues) [07:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:05] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:47:35] (03CR) 10Ladsgroup: "okay, I'm going to deploy this now." [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:47:48] (03CR) 10Ladsgroup: [C: 03+2] P:mariadb: allow wikitech to connect to es* [puppet] - 10https://gerrit.wikimedia.org/r/929956 (https://phabricator.wikimedia.org/T339079) (owner: 10Majavah) [07:48:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929966 (https://phabricator.wikimedia.org/T338934) (owner: 10Gergő Tisza) [07:48:50] (03CR) 10Majavah: [C: 04-1] Allow the IDMs on access m5's dbproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:50:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:51:32] (03PS1) 10Slyngshede: Requirements: Cleanup requirements files. [software/bitu] - 10https://gerrit.wikimedia.org/r/929960 [07:53:53] (03PS1) 10Gehel: query_service: migrate WDQS to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) [07:55:01] (03PS5) 10Muehlenhoff: Allow the IDMs to access m5 [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) [07:55:03] (03CR) 10Muehlenhoff: Allow the IDMs to access m5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:57:24] so I added grants on eqiad es hosts in es4 and es5 [07:57:27] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:57:33] it should fix it [07:58:03] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:58:09] tgr_: does things work for you now? [07:59:11] (03CR) 10Marostegui: [C: 03+1] hiera: remove ms-be104[0-3] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/929730 (https://phabricator.wikimedia.org/T335281) (owner: 10MVernon) [07:59:35] (03PS1) 10Elukey: role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) [07:59:40] (03CR) 10MVernon: [C: 03+2] hiera: remove ms-be104[0-3] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/929730 (https://phabricator.wikimedia.org/T335281) (owner: 10MVernon) [08:00:06] jnuche and jeena: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T0800). [08:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:01:14] (03CR) 10Muehlenhoff: [C: 03+2] Allow the IDMs to access m5 [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [08:01:37] Amir1: thanks, that seems to have fixed it [08:02:08] Emperor: I'll puppet-merge your swift change along? [08:02:11] taavi made the patch [08:02:16] jnuche: jeena: sorry, I'm running over with the backports, we had some unrelated issues [08:02:47] tgr_: no worries, I'll wait until the wikitech/DBs issues are solved [08:03:47] the wikitech issue is fixed I think. But I have two backports which will prevent some features from breaking when the train gets to group1. [08:04:06] I guess I should just force merge to speed things up. [08:04:46] (03CR) 10Gergő Tisza: [V: 03+2] "Speeding things up, we had some issues with the backport window" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929966 (https://phabricator.wikimedia.org/T338934) (owner: 10Gergő Tisza) [08:05:22] tgr_: don't rush things because of the train, it can wait for a bit :) [08:05:29] !log tgr@deploy1002 Started scap: Backport for [[gerrit:929966|Structured tasks: Fix toolbar rewriting (T338934)]] [08:05:33] T338934: [betalabs] Duplicate Publish button for Structured tasks - https://phabricator.wikimedia.org/T338934 [08:06:24] the CI for those patches takes 20-30 minutes usually. [08:07:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) 05Resolved→03Open [08:07:38] !log tgr@deploy1002 tgr: Backport for [[gerrit:929966|Structured tasks: Fix toolbar rewriting (T338934)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:07:45] the train window is 2h, I think it should work [08:07:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) Remaining steps: reimage the node [08:08:25] and there's an another blocker (T338927) which doesn't even have a fix in master yet [08:08:25] T338927: RecentChanges missing arrow to expand collapsed entries - https://phabricator.wikimedia.org/T338927 [08:09:43] taavi: is someone working on that? [08:10:00] not according to the task [08:10:37] let's just revert it then [08:15:06] (03PS1) 10Gergő Tisza: Revert "jquery.makeCollapsible: Use `unset: all` on buttons" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929969 (https://phabricator.wikimedia.org/T333357) [08:15:25] (03PS17) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:15:31] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:16:25] (03CR) 10Gergő Tisza: [C: 03+2] "backporting, train blocker" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929969 (https://phabricator.wikimedia.org/T333357) (owner: 10Gergő Tisza) [08:16:51] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [08:18:21] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:929966|Structured tasks: Fix toolbar rewriting (T338934)]] (duration: 12m 52s) [08:18:25] T338934: [betalabs] Duplicate Publish button for Structured tasks - https://phabricator.wikimedia.org/T338934 [08:18:42] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:19:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [08:20:10] moritzm: sorry, yes [08:20:49] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [08:20:59] PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [08:22:09] ack, done [08:23:15] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:24:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929969 (https://phabricator.wikimedia.org/T333357) (owner: 10Gergő Tisza) [08:25:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Pretty good overall, thanks! Couple of minor pedantic comments and we should be good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [08:31:26] (03PS18) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:33:06] (03PS3) 10Stevemunene: analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) [08:33:16] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [08:33:44] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:35:15] (03PS19) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:35:49] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:36:39] (KeyholderUnarmed) resolved: 2 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:37:03] (03PS20) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:37:40] (03CR) 10Clément Goubert: "Thanks for the review 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [08:37:42] (03PS4) 10Stevemunene: analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) [08:38:43] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [08:39:39] (03Merged) 10jenkins-bot: Revert "jquery.makeCollapsible: Use `unset: all` on buttons" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929969 (https://phabricator.wikimedia.org/T333357) (owner: 10Gergő Tisza) [08:40:06] !log tgr@deploy1002 Started scap: Backport for [[gerrit:929969|Revert "jquery.makeCollapsible: Use `unset: all` on buttons" (T333357 T338927)]] [08:40:11] T338927: RecentChanges missing arrow to expand collapsed entries - https://phabricator.wikimedia.org/T338927 [08:40:12] T333357: Please add role=button in Collapsible Elements to a.mw-collapsible-text - https://phabricator.wikimedia.org/T333357 [08:41:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:41:51] !log tgr@deploy1002 tgr: Backport for [[gerrit:929969|Revert "jquery.makeCollapsible: Use `unset: all` on buttons" (T333357 T338927)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:45:21] (03CR) 10Slyngshede: P:hive::client move beeline script to files. (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:45:27] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:56] (03CR) 10Jbond: Allow the IDMs to access m5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [08:47:58] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung) [08:48:21] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:929969|Revert "jquery.makeCollapsible: Use `unset: all` on buttons" (T333357 T338927)]] (duration: 08m 14s) [08:48:26] T338927: RecentChanges missing arrow to expand collapsed entries - https://phabricator.wikimedia.org/T338927 [08:48:26] T333357: Please add role=button in Collapsible Elements to a.mw-collapsible-text - https://phabricator.wikimedia.org/T333357 [08:50:06] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] "Backport - force merging to speed things up" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929967 (https://phabricator.wikimedia.org/T339046) (owner: 10Gergő Tisza) [08:50:42] (03CR) 10Jbond: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [08:51:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:08] (03PS1) 10Fabfur: hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) [08:51:12] !log Restarting Zuul to apply config change for T309376 [08:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:16] T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [08:51:22] !log tgr@deploy1002 Started scap: Backport for [[gerrit:929967|Section images: Pass section parameters to VE in add image tasks (T339046)]] [08:51:26] T339046: Rejecting a GrowthExperiments image recommendation breaks the editor - https://phabricator.wikimedia.org/T339046 [08:51:32] (03CR) 10CI reject: [V: 04-1] hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:51:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:52:29] (03PS2) 10Fabfur: hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) [08:53:05] !log tgr@deploy1002 tgr: Backport for [[gerrit:929967|Section images: Pass section parameters to VE in add image tasks (T339046)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:54:13] and rolling back [08:54:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:55:37] (03CR) 10Elukey: [C: 03+1] analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [08:56:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:56:49] (03PS21) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:57:00] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:58:11] !log Rolling back Zuul config change and restarting Zuul to clear ssh connections [08:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:18] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:929967|Section images: Pass section parameters to VE in add image tasks (T339046)]] (duration: 07m 55s) [08:59:22] T339046: Rejecting a GrowthExperiments image recommendation breaks the editor - https://phabricator.wikimedia.org/T339046 [09:00:08] jnuche: done, thanks for your patience. [09:00:25] !log UTC morning deploys done [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:31] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: make oauth client identifier configurable [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:02:14] tgr_: thanks a lot for those backports [09:03:49] hashar: is zuul healthy atm? I don't see anything in the queue [09:04:10] nope :( [09:04:50] damn :( will wait a bit longer for the train then [09:04:52] I broke it somehow but no idea how [09:04:56] OH man the train [09:05:02] can I help with anything? [09:05:06] (03CR) 10Jbond: [C: 04-1] Create a CDN host reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [09:05:16] (03PS22) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [09:05:22] I rolled back the integration/config patch which sould be sufficient [09:05:53] I guess the zuul config needs a rollback as well [09:06:33] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) @Jclark-ctr The server is down and alerts disabled, in case you need to open it or do any servicing. [09:07:02] manually rolled back and restarting Zuul again [09:07:49] jnuche: it is back! [09:07:54] (03PS10) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [09:08:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Jclark-ctr I tried to reimage the node but `ipmitool` didn't work, so I tried to reset the bmc locally but still no luck. I can access to the mgmt console and run racadm commands,... [09:09:01] (03CR) 10Vgutierrez: "looks good but hosts list for PCC is not enough.. it should cover one server per cluster and DC" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:09:21] (03PS1) 10Hashar: Revert "zuul: add a gerrit-reporter gerrit connection" [puppet] - 10https://gerrit.wikimedia.org/r/929971 (https://phabricator.wikimedia.org/T309376) [09:09:32] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10KCVelaga_WMF) 05Open→03Resolved It worked! Thank you so much @BTullis @cmooney and @mpopov [09:09:49] hashar: neat! thanks :) [09:09:55] rolling out train in a minute [09:10:29] jnuche: I ended up rolling back all my Zuul related changes :-\ [09:10:44] I tested it with test/gerrit-ping.git and it seems to behave properly again [09:10:47] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:11:09] !log zuul: rolled back config changes for T309376 and restarted Zuul. CI is back up. [09:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:14] T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [09:11:35] sry to read that, it's too bad we don't have a testing env for gerrit [09:12:25] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:12:38] (03CR) 10Jbond: [C: 04-1] Create a CDN host reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [09:12:46] (03PS23) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [09:13:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929993 (https://phabricator.wikimedia.org/T337527) [09:13:17] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929993 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [09:14:02] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929993 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [09:15:40] (03CR) 10Clément Goubert: [C: 03+2] Revert "zuul: add a gerrit-reporter gerrit connection" [puppet] - 10https://gerrit.wikimedia.org/r/929971 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [09:18:43] (03CR) 10Slyngshede: P:hive::client move beeline script to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:21:01] (03PS1) 10Vgutierrez: admin: Update vgutierrez@yubikey5 key [puppet] - 10https://gerrit.wikimedia.org/r/929994 (https://phabricator.wikimedia.org/T336769) [09:21:05] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:15] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.13 refs T337527 [09:21:20] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [09:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:21:36] !log installing php7.4 security updates [09:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:11] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:22:14] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8928451, @cmooney wrote: > @aborrero I discussed the idea of a [[ https://wikitech... [09:24:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:35] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:13] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.13 refs T337527 (duration: 06m 56s) [09:28:17] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [09:28:43] (03CR) 10Jbond: Add a define to declare an nftables set in Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:28:46] (03PS3) 10Fabfur: hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) [09:28:53] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930600, @aborrero wrote: > I think that's the `query-local-address`option. Upstre... [09:29:10] (03CR) 10CI reject: [V: 04-1] hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:31:50] (03PS4) 10Fabfur: hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) [09:32:31] (03PS2) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) [09:32:43] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:38:11] (03CR) 10Marostegui: Allow the IDMs to access m5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929952 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [09:46:08] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [09:46:33] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) 05In progress→03Resolved a:03ayounsi With row D upgraded, I couldn... [09:46:49] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930604, @cmooney wrote: > >> Could you describe the setup you have in mind? Woul... [09:47:23] (03PS1) 10Elukey: Revert "ml-services: remove falcon llm deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929978 [09:47:54] (03PS2) 10Elukey: Revert "ml-services: remove falcon llm deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929978 [09:48:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:49:15] (03PS2) 10Elukey: role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) [09:51:21] (03CR) 10Elukey: [C: 03+2] Revert "ml-services: remove falcon llm deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929978 (owner: 10Elukey) [09:52:28] (03PS1) 10Zabe: Stop setting $wgCommentTempTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) [09:52:43] (03PS1) 10Vgutierrez: users: Replace vgutierrez RSA key with an ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/929998 (https://phabricator.wikimedia.org/T336769) [09:52:55] (03CR) 10Zabe: [C: 04-2] "wmf.13 needs to be everywhere first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [09:52:56] I'm rolling back the train due to https://phabricator.wikimedia.org/T339094 [09:54:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:55:04] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929999 (https://phabricator.wikimedia.org/T337527) [09:55:06] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929999 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [09:55:49] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929999 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [09:58:45] (03PS1) 10AikoChou: ml-services: update outlink stream name and config autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/930000 (https://phabricator.wikimedia.org/T328899) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1000) [10:00:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [10:01:41] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [10:02:14] (03PS1) 10Hnowlan: handler.images: await poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) [10:03:26] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1040-1043].eqiad.wmnet [10:03:35] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.13 refs T337527 [10:03:39] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [10:05:00] (03PS1) 10AikoChou: changeprop: add outlink stream to changeprop prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/930002 (https://phabricator.wikimedia.org/T328899) [10:06:24] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10hnowlan) In some cases I think this is a manifestation of T337649 as many of the files on officewiki are PDFs and the like, but there is something else at work here. I somewhat suspect poolcounter... [10:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:11] group1 wikis are back to wmf.12, I need to step out for a bit [10:08:08] (03CR) 10CI reject: [V: 04-1] handler.images: await poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:09:00] (03CR) 10Kamila Součková: [C: 04-1] "You cannot use await in a synchronous function, you have make the parent function `async` too and bubble it up..." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:11:05] !log disable cr1<->row D link for link migration - T313463 [10:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:09] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [10:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:12:47] (03CR) 10TheDJ: "This looks pretty good as an intermediate solution I think. If zh-Hans or zh-Hant are particularly common (I could imagine), then we could" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [10:22:01] (03PS1) 10Slyngshede: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/930004 [10:22:29] !log mvernon@cumin1001 START - Cookbook sre.dns.netbox [10:22:48] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10MatthewVernon) [10:23:06] (03PS2) 10Slyngshede: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/930004 [10:24:44] (03PS3) 10Slyngshede: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/930004 [10:24:52] !log mvernon@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1040-1043].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1001" [10:27:13] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41722/console" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1040-1043].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1001" [10:30:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:30:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1040-1043].eqiad.wmnet [10:30:47] 10SRE-swift-storage: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1001 for hosts: `ms-be[1040-1043].eqiad.wmnet` - ms-be1040.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [10:32:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:37] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [10:32:40] 10SRE-swift-storage: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10MatthewVernon) 05Open→03Resolved Nodes decommissioned, handover to DC-Ops in T339100. [10:33:11] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) 05Open→03Resolved All done :-) [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:50] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) GitLab replica service urls seem to be allowed for OIDC admin login now. However I get the same fronted error `Application Not A... [10:34:55] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) cr1<->row D is now operational on the new 40G link @Jclark-ctr Those 4 SMF cables can now be remo... [10:34:59] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10MSantos) [10:35:44] (03CR) 10Marostegui: [C: 03+1] "I think this could work" [puppet] - 10https://gerrit.wikimedia.org/r/930004 (owner: 10Slyngshede) [10:39:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/930004 (owner: 10Slyngshede) [10:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:16] !log eqiad row D, move VRRP primary to cr1 - T313463 [10:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [10:41:01] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @Galessandroni and @Elitre per https://wikitech.wikimedia.org/wiki/Maps/External_usage I filled T339102 wh... [10:41:37] (03PS4) 10Slyngshede: Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/930004 [10:42:13] (03CR) 10Daniel Kinzler: [C: 03+1] "The new code makes a lot of sense to me, but I was unable to test this locally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [10:42:15] (03PS5) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) [10:44:31] (03CR) 10Slyngshede: [C: 03+2] Allow the IDMs on access m5's dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/930004 (owner: 10Slyngshede) [10:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:07] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929751 [10:49:42] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10Elitre) @valerio.bozzolan may provide the affiliate support statement. [10:51:44] (03CR) 10Klausman: [C: 03+1] ml-services: update outlink stream name and config autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/930000 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:51:46] (03CR) 10Klausman: [C: 03+1] changeprop: add outlink stream to changeprop prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/930002 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:58:04] (03CR) 10Ladsgroup: [C: 03+1] "Thanks. It's good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [11:00:17] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update outlink stream name and config autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/930000 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:00:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] changeprop: add outlink stream to changeprop prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/930002 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:00:49] !log disable cr2<->row D link for link migration - T313463 [11:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:53] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [11:03:55] (03CR) 10Ladsgroup: [C: 04-2] "I prefer wmf.15 being deployed and stable everywhere first. There is no rush." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929997 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [11:04:32] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) I find this request fully legitimate, since Vikidia is a somewhat famous wiki-family, at least in Italy where I have more focus. The project is oriented to young people, in an area of contr... [11:05:20] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fix mgmt - jbond@cumin1001" [11:06:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fix mgmt - jbond@cumin1001" [11:07:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 202, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:08:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:02] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) As additional note, Wikimedia Italy officially endorsed Vikidia's activities in some occasions, also with economic support, like this one: https://wiki.wikimedia.it/wiki/Associazione:Delib... [11:10:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41725/console" [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:11:04] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10TheDJ) Were they a pre-existing user (in that case, are there examples), or would they be a new user ? I think I would consider this one to be 'within the mission', although I guess by the technical defini... [11:11:14] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) Small tip for @Galessandroni: to make SRE more happy, try adding a cute Task description here in Purpose/details. [11:11:23] !log eqiad row D, move VRRP primary back to cr2 - T313463 [11:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:28] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [11:11:42] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930653, @aborrero wrote: > > I think I'm proposing this: > Talking with @taavi... [11:12:07] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930653, @aborrero wrote: > My point is that we could go with the 2 public IPv4 add... [11:13:49] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Ok, I think we are in the same page! [11:14:04] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) cr2<->row D is operational as well on the 40G link. @Jclark-ctr that brings to 8 total the SMF to be removed.... [11:14:21] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:14:48] (03PS1) 10Hnowlan: thumbor: include poolcounter.failure metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/930158 (https://phabricator.wikimedia.org/T337649) [11:16:10] (03CR) 10Hnowlan: [C: 03+2] thumbor: split expensive format poolcounter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:16:54] (03Merged) 10jenkins-bot: thumbor: split expensive format poolcounter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:17:09] (03PS11) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:17:16] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [11:17:45] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:20:35] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink stream name and config autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/930000 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:21:31] (03Merged) 10jenkins-bot: ml-services: update outlink stream name and config autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/930000 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:22:54] (03CR) 10AikoChou: [C: 03+2] changeprop: add outlink stream to changeprop prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/930002 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:22:55] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) [11:23:55] (03Merged) 10jenkins-bot: changeprop: add outlink stream to changeprop prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/930002 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:24:26] (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) [11:28:35] (03CR) 10Cathal Mooney: "This is the change we should ultimately make, however we can't merge this until the NS entries for codfw1dev.wikimedia.cloud point to the " [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [11:29:18] PROBLEM - MariaDB Replica Lag: s4 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 950.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:29:42] Amir1: ^ [11:30:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack: codfw1dev: designate: listen-on only the new address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [11:39:45] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:41:11] (03PS12) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:41:35] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:42:16] (03CR) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:48:08] (03PS13) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:48:32] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:55:50] (03PS1) 10Ladsgroup: Fix cases of LogicException in $update->getParserOutputForMetaData() [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929980 (https://phabricator.wikimedia.org/T339094) [11:58:49] jouncebot: nowandnext [11:58:49] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [11:58:49] In 1 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1300) [11:58:58] marostegui: sorry I missed the ping [11:59:01] I downtime it [11:59:07] thanks [11:59:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [11:59:40] T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 [11:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [12:00:11] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fix mgmt for cloudservices2004-dev - jbond@cumin1001" [12:00:38] (03CR) 10Ladsgroup: [C: 03+2] Fix cases of LogicException in $update->getParserOutputForMetaData() [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929980 (https://phabricator.wikimedia.org/T339094) (owner: 10Ladsgroup) [12:01:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fix mgmt for cloudservices2004-dev - jbond@cumin1001" [12:05:18] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10jbond) p:05Triage→03Medium [12:06:41] (03PS1) 10Slyngshede: Allow IDM mariadb access, ferm rules. [puppet] - 10https://gerrit.wikimedia.org/r/930164 [12:07:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1492.eqiad.wmnet [12:10:18] (03CR) 10Slyngshede: "I believe the firewall rules may have failed because of the wrong host names I entered originally." [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:10:41] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10cmooney) I'm somewhat assuming what happened here but I think this is correct. The background here is that the decom cookbook removes the DNS name against... [12:12:28] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: also skip failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) [12:13:19] (03CR) 10Ladsgroup: handler.images: await poolcounter release (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [12:13:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [12:14:18] (03PS14) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [12:14:42] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:16:13] (03PS15) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [12:16:43] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10jbond) > As discussed on irc the answer might be for the sync cookbook to ignore devices in 'failed' i have sent a change to do this... [12:17:43] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: also skip failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) [12:18:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929980 (https://phabricator.wikimedia.org/T339094) (owner: 10Ladsgroup) [12:19:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:20:54] (03Merged) 10jenkins-bot: Fix cases of LogicException in $update->getParserOutputForMetaData() [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929980 (https://phabricator.wikimedia.org/T339094) (owner: 10Ladsgroup) [12:21:22] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929980|Fix cases of LogicException in $update->getParserOutputForMetaData() (T339094)]] [12:21:25] jnuche: the backport will go forward in a minute, wanna roll forward afterwards? [12:21:26] T339094: LogicException: Must call prepareContent() or prepareUpdate() before calling MediaWiki\Storage\DerivedPageDataUpdater::getRenderedRevision - https://phabricator.wikimedia.org/T339094 [12:21:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw1492.eqiad.wmnet [12:22:36] Amir1: will do, thanks a lot [12:22:54] ^_^ [12:23:09] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929980|Fix cases of LogicException in $update->getParserOutputForMetaData() (T339094)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [12:23:35] (03CR) 10Ayounsi: [C: 03+1] sre.puppet.sync-netbox-hiera: also skip failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [12:23:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @elukey ` racadm>>racadm config ERROR: RAC1281: Unable to run the command because an invalid command is entered. The command "racadm config" entered is not supported o... [12:24:01] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. I slightly worry that host['ip_addresses'][0] won't always be the mgmt/primary IP (some servers like LVS don't have dns name on all" [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [12:24:09] (03CR) 10Jbond: "LGTM FYI i think moritz has a similar cr out" [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:25:55] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: also skip failed hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/930165 (https://phabricator.wikimedia.org/T339121) (owner: 10Jbond) [12:26:35] (03PS2) 10Slyngshede: Allow IDM mariadb access, ferm rules. [puppet] - 10https://gerrit.wikimedia.org/r/930164 [12:26:58] 10SRE, 10procurement: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585 (10ayounsi) [12:27:23] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:28:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:29:00] (03CR) 10Slyngshede: Allow IDM mariadb access, ferm rules. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:29:44] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929980|Fix cases of LogicException in $update->getParserOutputForMetaData() (T339094)]] (duration: 08m 21s) [12:29:48] T339094: LogicException: Must call prepareContent() or prepareUpdate() before calling MediaWiki\Storage\DerivedPageDataUpdater::getRenderedRevision - https://phabricator.wikimedia.org/T339094 [12:31:46] going to roll the train forward in a couple mins [12:32:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:33:22] (03CR) 10Slyngshede: [C: 03+2] Allow IDM mariadb access, ferm rules. [puppet] - 10https://gerrit.wikimedia.org/r/930164 (owner: 10Slyngshede) [12:33:54] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930168 (https://phabricator.wikimedia.org/T337527) [12:33:56] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930168 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [12:34:45] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930168 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [12:36:21] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add a define to declare an nftables set in Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:41:16] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add a define to declare an nftables set in Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:41:39] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.13 refs T337527 [12:41:44] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [12:42:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10jbond) p:05Triage→03Medium [12:45:43] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) [12:47:50] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.13 refs T337527 (duration: 06m 10s) [12:47:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10jbond) The ultimate failure here is that the first puppet run failed. i belive that @Volans has looked at this in the past and its not... [12:47:54] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [12:50:40] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10Jclark-ctr) @jcrespo replaced Dimm 8 on processor 2 with comparable 32gb hp Dimm [12:50:44] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add a define to declare an nftables set in Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:50:45] 10SRE, 10Maps: Allow Wikimedia Maps usage on vikidia.org - https://phabricator.wikimedia.org/T339102 (10valerio.bozzolan) I would like to help @TheDJ but I didn't understand the "new user" in what. [12:54:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Jclark-ctr What I meant was if we have an alternative, or if this is the first time that the issue comes up :) If it is we'll need to find an alternative command for iDRAC 4.40+, i... [12:56:57] (03CR) 10Ottomata: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [12:57:00] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:13] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:24] yup, nothing to deploy [13:00:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:00:38] (also, I think the train was a bit delayed so it’s probably good to leave a bit of a break after it anyways) [13:01:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: services: add support for BGP public IPv4 addresses [puppet] - 10https://gerrit.wikimedia.org/r/930170 (https://phabricator.wikimedia.org/T307357) [13:01:42] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:04:17] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Deal with archival of Stretch on Debian mirrors - https://phabricator.wikimedia.org/T335282 (10hashar) Karthoterian was still using `docker-registry.wikimedia.org/nodejs10-slim` and `docker-registry.wikimedia.org/nodejs10-devel` a... [13:04:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/930170 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:05:05] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:05:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41726/console" [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:05:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:07:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:08:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:08:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:12:31] (03CR) 10Stevemunene: [C: 03+2] analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [13:13:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:14:42] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:18:45] (03PS1) 10Alexandros Kosiaris: service::catalog: Depuplicate search service IPs [puppet] - 10https://gerrit.wikimedia.org/r/930175 [13:23:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @Papaul have you run into this before? i do see ipmi is enabled for idrac [13:23:36] (03PS3) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) [13:27:25] (03PS1) 10RobH: adding no controller sku to disallowed [software] - 10https://gerrit.wikimedia.org/r/930176 [13:28:52] (03CR) 10Dzahn: "This is an example why I didn't like a merge without doing the restart right after." [puppet] - 10https://gerrit.wikimedia.org/r/929971 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [13:29:01] (03CR) 10RobH: [C: 03+2] adding no controller sku to disallowed [software] - 10https://gerrit.wikimedia.org/r/930176 (owner: 10RobH) [13:30:07] (03CR) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:32:08] (03CR) 10Jelto: [C: 03+2] miscweb: add transparencyreport release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929667 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [13:32:55] (03Merged) 10jenkins-bot: miscweb: add transparencyreport release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929667 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [13:32:57] (03PS1) 10Slyngshede: Correctly locate firewall type for IDM. [puppet] - 10https://gerrit.wikimedia.org/r/930177 [13:34:21] (03PS1) 10Alexandros Kosiaris: recommendation-api: Remove 84% of assigned capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/930178 (https://phabricator.wikimedia.org/T338471) [13:34:31] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Jhancock.wm) @aborrero the patch changes have been made and the server is currently connected from eno1 to cloudsw ge-0/0/11 [13:36:08] (03PS16) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [13:36:16] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:01] (03CR) 10Ottomata: "Naw, it's because they were created before we implemented canary events:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:37:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:47] (03CR) 10Elukey: [C: 03+1] recommendation-api: Remove 84% of assigned capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/930178 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:38:24] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:38:52] (03CR) 10Slyngshede: "The dbproxies for M5 are in marked as cloud+lists in Puppet, so we need the IDM rules to be applied to that set of servers as well." [puppet] - 10https://gerrit.wikimedia.org/r/930177 (owner: 10Slyngshede) [13:39:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41728/console" [puppet] - 10https://gerrit.wikimedia.org/r/930177 (owner: 10Slyngshede) [13:41:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @Jclark-ctr check firmware version if old upgrade. If you can not access the IDRAC to check the firmware, reset the IDRAC first [13:42:05] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:42:47] !log adjusting port buffer partition asw-a-codfw T284592 [13:42:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) 05Open→03Stalled IRC Update Summary: * Papaul attempted to install hosts, OS won't see the disks. * Disks show in bios, Rob pinged to dou... [13:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:50] T284592: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 [13:44:06] !log imported jenkins 2.401.1 to thirdparty/ci for buster-wikimedia [13:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:01] !log adjusting port buffer partition asw-b-codfw T284592 [13:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:33] !log adjusting port buffer partition asw-c-codfw T284592 [13:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:36] T284592: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 [13:50:44] (03CR) 10Marostegui: [C: 03+1] Correctly locate firewall type for IDM. [puppet] - 10https://gerrit.wikimedia.org/r/930177 (owner: 10Slyngshede) [13:52:36] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:53:11] !log adjusting port buffer partition asw-d-codfw T284592 [13:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930178 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:56:06] (03Merged) 10jenkins-bot: recommendation-api: Remove 84% of assigned capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/930178 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:57:24] !log adjusting port buffer partition asw2-ulsfo T284592 [13:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:27] T284592: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 [13:57:39] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [13:57:55] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [13:58:21] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [13:58:24] !log adjusting port buffer partition asw1-eqsin T284592 [13:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:47] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [13:59:38] !log adjusting port buffer partition asw2-esams T284592 [13:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:03] (03PS1) 10Jelto: misweb: use different host http header for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/930180 (https://phabricator.wikimedia.org/T338781) [14:03:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:04:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:05:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:10] (03CR) 10Ottomata: [C: 03+2] Declare mediawiki.page_outlink_topic_prediction_change.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:08:39] (03CR) 10Cathal Mooney: [C: 03+2] Remove optional var to set COS buffers for QFX/EX switches [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [14:09:02] (03Merged) 10jenkins-bot: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:09:22] (03Merged) 10jenkins-bot: Remove optional var to set COS buffers for QFX/EX switches [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [14:10:43] (03PS1) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [14:11:52] (03CR) 10CI reject: [V: 04-1] gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [14:12:26] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:12:37] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:12:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50135 bytes in 0.425 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:05] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:13:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:25] (03CR) 10Jelto: [C: 03+2] misweb: use different host http header for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/930180 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:14:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @papaul firmware was a very old version i did update. idrac is reachable and has been the entire time. the issue is ipmitool is not working [14:14:29] (03Merged) 10jenkins-bot: misweb: use different host http header for readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/930180 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [14:14:43] 10SRE, 10Infrastructure-Foundations, 10netops: test_matching_vlan() function crashig in Netbox network report - https://phabricator.wikimedia.org/T339133 (10cmooney) p:05Triage→03Low [14:14:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:18] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:15:41] !log dns2006: updating gdnsd package [14:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:30] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:17:15] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:17:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:41] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:17:43] (03PS17) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [14:18:10] (03CR) 10Hnowlan: [C: 03+2] thumbor: include poolcounter.failure metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/930158 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [14:19:01] (03Merged) 10jenkins-bot: thumbor: include poolcounter.failure metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/930158 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [14:19:21] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:19:47] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:19:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:20:23] 10SRE, 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10ssingh) [14:21:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:41] (03PS1) 10Jbond: tlsproxy::envoy: update support for profile::tlsproxy::envoy::services [puppet] - 10https://gerrit.wikimedia.org/r/930184 (https://phabricator.wikimedia.org/T326657) [14:21:43] (03PS1) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930185 (https://phabricator.wikimedia.org/T326657) [14:22:36] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: EventStreamConfig - Declare mediawiki.page_outlink_topic_prediction_change.v1 stream - T328899 (duration: 10m 25s) [14:22:39] T328899: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 [14:23:01] (03CR) 10Arturo Borrero Gonzalez: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:23:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add a define to declare an nftables set in Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:23:57] (03CR) 10CI reject: [V: 04-1] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services [puppet] - 10https://gerrit.wikimedia.org/r/930184 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [14:26:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:27:11] (03PS1) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) [14:27:30] (03PS2) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) [14:29:28] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:29:40] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:30:18] (03PS2) 10Jbond: tlsproxy::envoy: update support for profile::tlsproxy::envoy::services [puppet] - 10https://gerrit.wikimedia.org/r/930184 (https://phabricator.wikimedia.org/T326657) [14:30:20] (03PS2) 10Jbond: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930185 (https://phabricator.wikimedia.org/T326657) [14:30:38] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10hnowlan) Thumb.php hides the error because these are private wikis, but ultimately this results in `Too many thumbnail requests` which is another poolcounter throttle even for inexpensive formats. [14:30:50] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:32:17] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:32:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:33:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:35:18] (03CR) 10Jbond: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:35:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41731/console" [puppet] - 10https://gerrit.wikimedia.org/r/930185 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [14:35:44] (03PS1) 10Jelto: miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) [14:36:44] !log mwscript findBadBlobs.php --wiki=nlwiki --revisions 880583,880584,880585,880586,880587,880588,880589,880590,880591,880592,880593,880594,880595,880596,880597,880598,880599,880600,880601,880602 --mark "T128154" [14:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:48] T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154 [14:36:53] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) [14:37:16] (03PS1) 10Albertoleoncio: Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) [14:37:55] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) 05In progress→03Resolved servers have been removed from the racks but left in the hot aisle of row D. they will be moved to storage after the recy... [14:38:04] (03PS18) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [14:38:28] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:39:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:38] jouncebot: nowandnext [14:39:38] No deployments scheduled for the next 2 hour(s) and 20 minute(s) [14:39:39] In 2 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1700) [14:39:45] (03PS19) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [14:40:17] I’ll deploy a security patch in a few minutes if no one stops me [14:40:50] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:11] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) > I propose we define a different SLO that measures the availability of Logstash or the user-facing components. This seemed... [14:43:01] (03PS1) 10Gehel: query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 [14:44:29] (03PS1) 10Ladsgroup: Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) [14:44:43] going ahead [14:45:43] (03PS1) 10Jbond: nftables: add spec test [puppet] - 10https://gerrit.wikimedia.org/r/930193 [14:46:16] (03CR) 10CI reject: [V: 04-1] nftables: add spec test [puppet] - 10https://gerrit.wikimedia.org/r/930193 (owner: 10Jbond) [14:47:33] (03PS2) 10Hnowlan: handler.images: await poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) [14:48:24] (03PS2) 10Gehel: query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 [14:48:33] (03CR) 10Hnowlan: handler.images: await poolcounter release (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [14:48:53] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel) [14:48:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @Jclark-ctr can you check that it is enable in the idrac [14:49:04] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [14:49:20] (03CR) 10CI reject: [V: 04-1] handler.images: await poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [14:49:50] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:50:04] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) 05In progress→03Invalid Thanks for that extra bit of information, @Mschon. I'm going to close this as INVALID until there's something here to service. [14:50:18] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @Papaul it is enabled in idrac [14:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:53:17] !log lucaswerkmeister-wmde: Deployed security patch for T250720 [14:54:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:55:05] (03PS3) 10Hnowlan: handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) [14:55:25] (03PS3) 10Gehel: query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 [14:56:35] (03CR) 10CI reject: [V: 04-1] handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [14:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:03] (03PS4) 10Hnowlan: handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) [14:59:03] (03CR) 10Gehel: "still multiple issues according to PCC" [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [14:59:21] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [14:59:24] (03CR) 10CI reject: [V: 04-1] handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [15:00:06] (03PS1) 10Jaime Nuche: jenkins: add secret with doc rsync pass to releases instance [labs/private] - 10https://gerrit.wikimedia.org/r/930195 (https://phabricator.wikimedia.org/T336168) [15:00:20] !log lucaswerkmeister-wmde: Deployed security patch for T250720 [15:01:29] (03CR) 10Gehel: "PCC seems reasonable. It would make sense to disable puppet and test on a single node first, including a restart of blazegraph and the upd" [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel) [15:01:52] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) Thank you, @Jclark-ctr , that was quick! I am going to put it back into service test it is ok, and I will resolve the issue then. [15:03:40] (03PS2) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [15:04:04] * Lucas_WMDE done [15:04:04] (03CR) 10CI reject: [V: 04-1] gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [15:05:13] (03PS3) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [15:05:42] (03CR) 10CI reject: [V: 04-1] gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [15:06:33] (03PS4) 10Gehel: query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 [15:07:49] (03PS4) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [15:08:17] (03CR) 10CI reject: [V: 04-1] gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [15:09:38] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:10:03] (03PS3) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) [15:12:25] (03PS5) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [15:12:27] (03CR) 10CI reject: [V: 04-1] apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:12:35] (03PS4) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) [15:16:17] (03CR) 10JHathaway: apt: ensure profile::apt is applied before packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:17:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey racadm config no longer works you need to use racadm set [15:18:59] (03PS5) 10Hnowlan: handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) [15:21:48] (03PS9) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [15:22:13] (03CR) 10BCornwall: Create a CDN host reboot cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:23:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul yep yep, I was wondering if dcops had any sussgestion about racadm set, never used it and I don't find an alternative command to use.. [15:24:23] (03PS1) 10Elukey: ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 [15:25:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey what was it you was trying to do? I see you last comment said re-image failed and you was going to reset the password on the node I am right? [15:25:28] (03CR) 10CI reject: [V: 04-1] Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:26:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul basically try https://wikitech.wikimedia.org/wiki/Management_Interfaces#Did_you_do_a_reset_but_still_getting_IPMI_connection_failed_(when_using_the_reimage_cookbook) to see... [15:28:18] (03CR) 10Daimona Eaytoy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [15:28:31] (03PS10) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [15:36:18] (03PS1) 10JHathaway: Revert "dev env: don't manage resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/929983 [15:39:07] (03CR) 10JHathaway: [C: 03+2] Revert "dev env: don't manage resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/929983 (owner: 10JHathaway) [15:40:21] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) [15:42:25] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) (duration: 02m 03s) [15:42:34] (03PS20) 10Jbond: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:44:29] (03PS21) 10Jbond: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:45:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:46:37] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:51:16] PROBLEM - Host mw1492 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:54] RECOVERY - Host mw1492 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:54:07] 1492 seems to have been problematic the last few days [15:54:31] (03CR) 10Arturo Borrero Gonzalez: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:54:31] brett: the reboot was papaul [15:54:40] well then [15:54:57] thanks for the clarification :^) [15:54:58] (03PS11) 10Jbond: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:55:02] (03PS1) 10Jbond: sre.__init__: add min_grace_sleep class param [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 [15:55:03] See #wikimedia-dcops if you're in [15:55:30] But it was depooled already so elukey told him he was fine to go ahead and work on it. No new issues. [15:55:54] (03CR) 10BCornwall: [C: 03+1] "Thanks for this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [15:56:33] (03PS1) 10Stang: simplewiki: Remove "changetags" from registered user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930206 (https://phabricator.wikimedia.org/T339124) [15:57:11] (03PS2) 10Stang: simplewiki: Remove "changetags" from registered user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930206 (https://phabricator.wikimedia.org/T339124) [15:57:52] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) [16:00:13] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) Looking good, about to resolve the ticket. @Ladsgroup As I told you on IRC, the data on this host (s1 and s2) was rolled back to midnight, 10 of June and may need reapplication of some maintenance... [16:00:41] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications for db1139 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/929984 [16:00:55] (03PS22) 10Jbond: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:02:54] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Disable notifications for db1139 after crash" [puppet] - 10https://gerrit.wikimedia.org/r/929984 (owner: 10Jcrespo) [16:07:02] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) 05Open→03Resolved [16:07:08] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC looks right: https://puppet-compiler.wmflabs.org/output/930170/41732/" [puppet] - 10https://gerrit.wikimedia.org/r/930170 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:07:31] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: codfw1dev: services: add support for BGP public IPv4 addresses [puppet] - 10https://gerrit.wikimedia.org/r/930170 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:09:14] (03PS2) 10Elukey: ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) [16:09:16] (03PS1) 10Elukey: kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) [16:09:32] (03PS12) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [16:10:05] (03PS2) 10JHathaway: profile::base: add param to not manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/929737 (https://phabricator.wikimedia.org/T337972) (owner: 10Jbond) [16:11:25] (03CR) 10JHathaway: "John, I updated the patch a bit, if you could review that would be great." [puppet] - 10https://gerrit.wikimedia.org/r/929737 (https://phabricator.wikimedia.org/T337972) (owner: 10Jbond) [16:11:56] (03PS13) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [16:14:04] (03Abandoned) 10JHathaway: dev env: don't pull firewall rules from etcd [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:15:39] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10Ladsgroup) Thanks I will re-run the schema changes on it ASAP [16:16:25] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2005-dev: fix DNS address [puppet] - 10https://gerrit.wikimedia.org/r/930210 (https://phabricator.wikimedia.org/T307357) [16:18:11] (03CR) 10BCornwall: sre.__init__: add min_grace_sleep class param (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [16:18:16] (03CR) 10BCornwall: [C: 04-1] sre.__init__: add min_grace_sleep class param [cookbooks] - 10https://gerrit.wikimedia.org/r/930205 (owner: 10Jbond) [16:20:58] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Papaul→03aborrero [16:23:38] (03CR) 10Elukey: "This change needs the following one, I'll merge both at the same time :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:31:44] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05aborrero→03Jhancock.wm Hey @Jhancock.wm and @Papaul : https://netbox.wikimedia.org/dcim/devices/4143/interface... [16:34:24] (03PS2) 10JHathaway: dev env: Add param to disable puppet agent timer [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) [16:35:12] (03CR) 10JHathaway: "reworked patch, kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:35:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. No need to do it that way though, if leaving .25 and ns0 on cloudservices2005-dev means less downtime we can keep it that way. Pha" [puppet] - 10https://gerrit.wikimedia.org/r/930210 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:36:22] (03Abandoned) 10JHathaway: dev env: allow setting $site via an env var [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:36:52] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2004-dev: put into service with new setup [puppet] - 10https://gerrit.wikimedia.org/r/930212 (https://phabricator.wikimedia.org/T338778) [16:37:15] (03PS2) 10Arturo Borrero Gonzalez: cloudservices2005-dev: fix DNS address [puppet] - 10https://gerrit.wikimedia.org/r/930210 (https://phabricator.wikimedia.org/T307357) [16:39:51] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Jhancock.wm→03aborrero >>! In T338778#8932260, @aborrero wrote: > Hey @Jhancock.wm and @Papaul : > > https://n... [16:43:32] (03PS1) 10Hnowlan: api-gateway: add device-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/930214 (https://phabricator.wikimedia.org/T338916) [16:46:59] (03PS1) 10Hnowlan: trafficserver: add route for device-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/930216 (https://phabricator.wikimedia.org/T338916) [16:47:30] (03PS1) 10Krinkle: webperf: Enable `logToSyslogCee` option for Excimer UI [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) [16:47:44] (03PS2) 10Krinkle: webperf: Enable `logToSyslogCee` option for Excimer UI [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) [16:48:01] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [16:48:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw1492.mgmt.eqiad.wmnet with reboot policy FORCED [16:48:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [16:51:02] (03PS1) 10Effie Mouzeli: iPoid: add network rules to m5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/930218 (https://phabricator.wikimedia.org/T336163) [16:51:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1492.mgmt.eqiad.wmnet with reboot policy FORCED [16:52:59] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [16:53:11] (03CR) 10Krinkle: "PCC confirms boolean indeed encodes as expected, wasn't sure whether it'd take bools corrrectly or not." [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [16:53:35] (03CR) 10Krinkle: "Underlying handling of this option isn't deployed yet, but harmless to roll out anytime ahead of that." [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [16:56:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey @Jclark-ctr this is fix now. I set the IDRAC to factory and run the provison cookbook. You can close the close if nothing else is left to do. Thanks ` pt1979@cumin1001:~$... [16:59:11] (03PS1) 10Ottomata: Bump primary schema repo version to get new page/change schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/930219 (https://phabricator.wikimedia.org/T325315) [16:59:30] (03CR) 10Ottomata: [C: 03+2] Bump primary schema repo version to get new page/change schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/930219 (https://phabricator.wikimedia.org/T325315) (owner: 10Ottomata) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1700) [17:01:28] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:01:57] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:02:16] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [17:02:54] (03CR) 10EoghanGaffney: [C: 03+1] "Credential added to private.git" [labs/private] - 10https://gerrit.wikimedia.org/r/930195 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [17:03:00] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [17:03:27] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:04:10] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:06:00] (03CR) 10Ilias Sarantopoulos: [C: 03+1] kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [17:06:56] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: set readinessProbe settings for falcon-7b (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [17:11:45] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8932283, @aborrero wrote: >>>! In T338778#8932260, @aborrero wrote: >> Hey @Jhancock.wm and @Papaul : >>... [17:16:12] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [17:16:36] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) 05Open→03Resolved a:03cmooney Change is now live on all relevant Juniper devices. [17:21:43] (03PS1) 10Cathal Mooney: Remove cloudsw2-c8-eqiad from Homer static YAML definitions [homer/public] - 10https://gerrit.wikimedia.org/r/930220 (https://phabricator.wikimedia.org/T338459) [17:23:28] (03CR) 10Cathal Mooney: [C: 03+2] Remove cloudsw2-c8-eqiad from Homer static YAML definitions [homer/public] - 10https://gerrit.wikimedia.org/r/930220 (https://phabricator.wikimedia.org/T338459) (owner: 10Cathal Mooney) [17:24:07] (03Merged) 10jenkins-bot: Remove cloudsw2-c8-eqiad from Homer static YAML definitions [homer/public] - 10https://gerrit.wikimedia.org/r/930220 (https://phabricator.wikimedia.org/T338459) (owner: 10Cathal Mooney) [17:33:32] (03PS1) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) [17:34:18] (03CR) 10CI reject: [V: 04-1] Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney) [17:36:58] (03PS2) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) [17:37:41] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:37:58] (03CR) 10Jaime Nuche: [C: 03+2] jenkins: add secret with doc rsync pass to releases instance [labs/private] - 10https://gerrit.wikimedia.org/r/930195 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [17:42:17] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:42:25] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:42:45] (03CR) 10Jaime Nuche: [V: 03+2] jenkins: add secret with doc rsync pass to releases instance [labs/private] - 10https://gerrit.wikimedia.org/r/930195 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [17:42:59] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:43:09] (03CR) 10Jaime Nuche: [V: 03+2 C: 03+2] jenkins: add secret with doc rsync pass to releases instance [labs/private] - 10https://gerrit.wikimedia.org/r/930195 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [17:46:59] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:35] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:50:01] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:50:39] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:54:15] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:00:07] jnuche and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1800). [18:00:07] jnuche and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T1800) [18:11:06] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10Papaul) 05Open→03Resolved @Jclark-ctr this is fix i setup the second interface of cloudswift1001 to xe-0/0/18. so closing this task ` papaul@cloudsw1-c8-eqiad> show interfaces descriptions xe-0/0/18 Interf... [18:11:40] !log installing libssh security updates on buster [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:59] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:29:11] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@3d6caed]: Deploying mostly to rerun druid loading for mediawiki history reduced [18:29:20] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@3d6caed]: Deploying mostly to rerun druid loading for mediawiki history reduced (duration: 00m 09s) [18:36:57] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:37:37] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:39:11] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:41:39] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:41:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10BCornwall) Is this to say that the existing cookbook already suffices? If so, it sounds like the actionable here is to update the documentation to reflect Clement'... [18:43:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:56:23] (03CR) 10Herron: Add missing build dependencies for the Debian package (031 comment) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [19:03:13] (03PS1) 10Ssingh: Release 9.2.1-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) [19:08:00] (03CR) 10Ladsgroup: [C: 03+1] "le sigh." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [19:20:48] (03PS1) 10Effie Mouzeli: iPoid: helmfile.d updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/930233 (https://phabricator.wikimedia.org/T336163) [19:21:41] (03CR) 10Effie Mouzeli: [C: 03+2] iPoid: add network rules to m5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/930218 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [19:22:34] (03Merged) 10jenkins-bot: iPoid: add network rules to m5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/930218 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [19:23:29] (03PS2) 10Effie Mouzeli: iPoid: helmfile.d updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/930233 (https://phabricator.wikimedia.org/T336163) [19:26:34] (03CR) 10CI reject: [V: 04-1] Release 9.2.1-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [19:30:20] (03CR) 10Effie Mouzeli: [C: 03+2] iPoid: helmfile.d updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/930233 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [19:31:23] (03Merged) 10jenkins-bot: iPoid: helmfile.d updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/930233 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [19:32:27] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:33:57] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:38:11] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) 05Invalid→03Open [19:38:31] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:42:38] (03PS3) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [19:44:37] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:46:13] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Adding @Abit @Bmueller & @AAlikhan for visibility into this and to answer any questions or concerns that may come up. The Comms team is requesting a subdomain of... [19:49:09] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:50:51] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Dzahn) Per above, I suggest to rename this ticket to buy a domain like wikimedia.social. That would be via the Legal team. Then once you have that you can let us know and we (SRE... [19:51:15] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Mike_Peel) Still happening. "The MediaWiki error backend-fail-internal occured: An unknown error occurre... [19:51:26] (03CR) 10JHathaway: apt: ensure profile::apt is applied before packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [19:53:41] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:58:13] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230614T2000). [20:00:05] koi and albertoleoncio: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] hi [20:00:26] hi [20:00:47] I can deploy [20:01:29] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on otrs1001.eqiad.wmnet with reason: Replacing Host [20:01:41] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on otrs1001.eqiad.wmnet with reason: Replacing Host [20:02:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930206 (https://phabricator.wikimedia.org/T339124) (owner: 10Stang) [20:02:29] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:32] (03CR) 10Ssingh: "unit_tests/test_proxy_hdrs-unit_test_main.o: 'No space left on device' 😐" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [20:02:39] (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [20:02:46] (03PS6) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [20:02:47] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:03:24] (03Merged) 10jenkins-bot: simplewiki: Remove "changetags" from registered user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930206 (https://phabricator.wikimedia.org/T339124) (owner: 10Stang) [20:03:53] (03CR) 10CI reject: [V: 04-1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:03:54] !log taavi@deploy1002 Started scap: Backport for [[gerrit:930206|simplewiki: Remove "changetags" from registered user (T339124)]] [20:03:57] T339124: Remove changetags permission for registered users on the Simple English Wikipedia - https://phabricator.wikimedia.org/T339124 [20:04:33] (03PS7) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [20:06:02] !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:930206|simplewiki: Remove "changetags" from registered user (T339124)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:06:05] koi: please test your patch [20:06:34] taavi, tested Special:Listgrouprights and LGTM [20:06:39] syncing [20:07:23] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:07:33] (03PS1) 10AOkoth: vrts: failing over to vrts1001 [puppet] - 10https://gerrit.wikimedia.org/r/930245 (https://phabricator.wikimedia.org/T338418) [20:08:20] (03PS4) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [20:08:48] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab, 10Patch-For-Review: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [20:08:51] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:09:00] (03CR) 10Dzahn: [C: 03+1] vrts: failing over to vrts1001 [puppet] - 10https://gerrit.wikimedia.org/r/930245 (https://phabricator.wikimedia.org/T338418) (owner: 10AOkoth) [20:10:08] (03CR) 10AOkoth: [C: 03+2] vrts: failing over to vrts1001 [puppet] - 10https://gerrit.wikimedia.org/r/930245 (https://phabricator.wikimedia.org/T338418) (owner: 10AOkoth) [20:10:56] !log https://ticket.wikimedia.org down for migration [20:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:49] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:930206|simplewiki: Remove "changetags" from registered user (T339124)]] (duration: 08m 55s) [20:12:52] T339124: Remove changetags permission for registered users on the Simple English Wikipedia - https://phabricator.wikimedia.org/T339124 [20:13:16] koi: yours are done [20:13:22] albertoleoncio: looking at your patches now [20:13:23] PROBLEM - SSH on wdqs1012 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:13:29] ty [20:13:35] ok [20:13:51] taavi: is it too late for me to add something to the list? [20:14:06] arlolra: nope, please add it to wikitech [20:14:29] thanks [20:14:35] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929742 (https://phabricator.wikimedia.org/T338974) (owner: 10Albertoleoncio) [20:14:55] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:15:01] (03PS8) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [20:15:28] (03CR) 10CI reject: [V: 04-1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:16:14] (03PS1) 10Arlolra: Fix thumb styling on file description page [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930249 (https://phabricator.wikimedia.org/T337804) [20:16:39] albertoleoncio: I'm going to deploy the mobile talk page tab patch, but I'm not comfortable deploying the Translate one as I'm not familiar with the extension and can't check if the configuration you're proposing is fine or not. I'm going to add some reviewers to the patch so it can be deployed later [20:17:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929742 (https://phabricator.wikimedia.org/T338974) (owner: 10Albertoleoncio) [20:17:17] Sure! Not a problem [20:17:25] albertoleoncio: do you have the WikimediaDebug browser extension installed? [20:17:34] Yep [20:18:15] great! it'll take a minute or two for the patch to be available for testing, I'll let you know when it's ready [20:18:29] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [20:18:51] (03PS9) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [20:18:53] (03Merged) 10jenkins-bot: Enable mobile page tabs for everyone in ptwikisource. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929742 (https://phabricator.wikimedia.org/T338974) (owner: 10Albertoleoncio) [20:19:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10phaultfinder) [20:19:19] !log taavi@deploy1002 Started scap: Backport for [[gerrit:929742|Enable mobile page tabs for everyone in ptwikisource. (T338974)]] [20:19:23] T338974: Enable tabs for talk pages (and pagination of transcripts) for non-logged in users on mobile version of ptwikisource - https://phabricator.wikimedia.org/T338974 [20:19:51] (03PS1) 10AOkoth: ticket: otrs1001 -> vrts1001 [dns] - 10https://gerrit.wikimedia.org/r/930251 (https://phabricator.wikimedia.org/T338418) [20:20:19] (03CR) 10Dzahn: [C: 03+1] ticket: otrs1001 -> vrts1001 [dns] - 10https://gerrit.wikimedia.org/r/930251 (https://phabricator.wikimedia.org/T338418) (owner: 10AOkoth) [20:21:17] !log taavi@deploy1002 taavi and albertoleoncio: Backport for [[gerrit:929742|Enable mobile page tabs for everyone in ptwikisource. (T338974)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:21:26] (03CR) 10AOkoth: [C: 03+2] ticket: otrs1001 -> vrts1001 [dns] - 10https://gerrit.wikimedia.org/r/930251 (https://phabricator.wikimedia.org/T338418) (owner: 10AOkoth) [20:21:36] (03PS5) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [20:21:51] albertoleoncio: please test your patch on a debug server [20:22:35] Its working, tabs showing logged and unlogged [20:23:11] great! I will sync your patch to the entire cluster next [20:24:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10phaultfinder) [20:29:30] (03PS1) 10Ottomata: Remove outdated docs from eventgate chart README [deployment-charts] - 10https://gerrit.wikimedia.org/r/930253 [20:29:43] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:929742|Enable mobile page tabs for everyone in ptwikisource. (T338974)]] (duration: 10m 23s) [20:29:47] T338974: Enable tabs for talk pages (and pagination of transcripts) for non-logged in users on mobile version of ptwikisource - https://phabricator.wikimedia.org/T338974 [20:30:28] albertoleoncio: your patch is live! [20:31:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930249 (https://phabricator.wikimedia.org/T337804) (owner: 10Arlolra) [20:31:58] arlolra: sorry I didn't realize you had a core backport earlier, otherwise I'd have +2'd it before to save some time. oh well [20:32:06] (03PS9) 10Ottomata: charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [20:32:15] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [20:32:16] I think its not yet... hehehehe [20:32:25] (03CR) 10Ottomata: [C: 03+2] Remove outdated docs from eventgate chart README [deployment-charts] - 10https://gerrit.wikimedia.org/r/930253 (owner: 10Ottomata) [20:32:35] taavi: I've got nowhere to be [20:32:39] albertoleoncio: how so? [20:33:24] The tabs is not showing yet out of the debug server [20:33:48] Cache? [20:34:15] presumably yes, did you try purging the page? [20:34:29] (03CR) 10Ssingh: "CI failure due to "No space left on device", time to look into this." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [20:34:49] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on otrs1001.eqiad.wmnet with reason: Replacing Host [20:34:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on otrs1001.eqiad.wmnet with reason: Replacing Host [20:35:47] I can't purge unlogged on mobile [20:36:28] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:36:48] if you're logged out, you're definitely just hitting the caches [20:36:54] they'll clear eventually [20:37:42] Ok, I'll let it go by itself. Thank you! [20:37:48] happy to help! [20:41:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:42:04] RECOVERY - Check systemd state on ms-be1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:16] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [20:47:42] (03CR) 10Ssingh: "Reported in T339171." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [20:51:38] (03Merged) 10jenkins-bot: Fix thumb styling on file description page [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930249 (https://phabricator.wikimedia.org/T337804) (owner: 10Arlolra) [20:51:57] there we go [20:51:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:52:07] !log taavi@deploy1002 Started scap: Backport for [[gerrit:930249|Fix thumb styling on file description page (T337804)]] [20:52:11] T337804: Thumb styling on the file description page and other callers of makeThumbLinkObj - https://phabricator.wikimedia.org/T337804 [20:53:46] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:54:25] !log taavi@deploy1002 arlolra and taavi: Backport for [[gerrit:930249|Fix thumb styling on file description page (T337804)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:54:32] arlolra: please test [20:54:43] ok [20:55:31] (03PS1) 10Dzahn: peopleweb: sync home dirs from people1003 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930257 (https://phabricator.wikimedia.org/T338827) [20:55:48] (03CR) 10CI reject: [V: 04-1] peopleweb: sync home dirs from people1003 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930257 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:55:53] (03PS2) 10Dzahn: peopleweb: sync home dirs from people1003 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930257 (https://phabricator.wikimedia.org/T338827) [20:56:17] taavi: looks good, please proceed [20:56:30] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) Thanks again @Dzahn @Eoghan Seems to be working fine for now. Test ticket: https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=1284... [20:56:43] syncing [20:59:34] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Dzahn) @Arnoldokoth Great that all worked smoothly! Thank you. There is this other ticket (T295416) that was about upgrading OTRS (all hosts) to bullseye. Seems like t... [20:59:53] the window will overflow by a few minutes, but that shuld not be a problem since there's nothing directly afterwards [21:00:56] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Dzahn) In T338418 otrs1001 was replaced by vrts1001 in production today. Details are over there. This means we how have a bullseye VRTS... [21:02:52] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:930249|Fix thumb styling on file description page (T337804)]] (duration: 10m 44s) [21:02:56] T337804: Thumb styling on the file description page and other callers of makeThumbLinkObj - https://phabricator.wikimedia.org/T337804 [21:02:58] all done [21:03:07] taavi: thank you [21:05:21] (03PS1) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/930259 (https://phabricator.wikimedia.org/T253058) [21:05:37] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10thcipriani) [21:06:31] (03CR) 10CI reject: [V: 04-1] eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/930259 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [21:11:35] (03PS7) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [21:12:00] (03Abandoned) 10Ottomata: charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [21:12:14] (03Restored) 10Ottomata: charts/eventgate - Use mesh.networkpolicy.egress and base.networkpolicy.egress.kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [21:12:41] (03CR) 10CI reject: [V: 04-1] eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [21:12:43] (03Abandoned) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/930259 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [21:12:58] (03CR) 10Ottomata: "(oops, abandoned the wrong patch, restored.)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T335024) (owner: 10Giuseppe Lavagetto) [21:14:21] (03PS8) 10Ottomata: eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [21:15:17] (03CR) 10CI reject: [V: 04-1] eventgate-* - use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [21:22:12] (03PS1) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) [21:23:25] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [21:32:52] 10SRE-Access-Requests, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) [21:35:24] (03PS1) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) [21:38:40] !log phabricator - made dancy (https://phabricator.wikimedia.org/people/manage/25411/) and administrator (T339174) [21:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:44] T339174: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 [21:39:23] (03PS2) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) [21:47:34] (03CR) 10BCornwall: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929768 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [21:47:36] (03PS7) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929768 (https://phabricator.wikimedia.org/T301944) [21:52:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Write a cookbook to roll reboot cache hosts - https://phabricator.wikimedia.org/T338783 (10BCornwall) [21:52:23] 10SRE, 10Traffic: Create a cookbook to reboot CDN hosts - https://phabricator.wikimedia.org/T338813 (10BCornwall) [21:57:30] (03PS14) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [21:59:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:02:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:11:06] (03CR) 10Dzahn: [C: 03+2] peopleweb: sync home dirs from people1003 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930257 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [22:15:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:57:50] (03PS1) 10Dzahn: peoplweb: sync home dirs from people2002 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930272 (https://phabricator.wikimedia.org/T338827) [23:05:01] (03CR) 10Dzahn: [C: 03+2] peoplweb: sync home dirs from people2002 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930272 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [23:05:11] (03PS2) 10Dzahn: peoplweb: sync home dirs from people2002 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930272 (https://phabricator.wikimedia.org/T338827) [23:05:55] (03PS3) 10Dzahn: peopleweb: sync home dirs from people2002 to people1004 [puppet] - 10https://gerrit.wikimedia.org/r/930272 (https://phabricator.wikimedia.org/T338827) [23:07:06] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:11:40] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:15:35] (03CR) 10Dzahn: [C: 03+1] miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [23:20:42] (03PS1) 10Dzahn: people.wikimedia.org: switch backend from people2002 to people1004 [dns] - 10https://gerrit.wikimedia.org/r/930274 (https://phabricator.wikimedia.org/T338827) [23:23:39] (03PS2) 10Dzahn: people.wikimedia.org: switch backend from people2002 to people1004 [dns] - 10https://gerrit.wikimedia.org/r/930274 (https://phabricator.wikimedia.org/T338827) [23:25:49] (03CR) 10Dzahn: "by the way, this is the way to get the technical answer to what the current active DC is:" [dns] - 10https://gerrit.wikimedia.org/r/930274 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [23:26:00] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [23:26:31] (03CR) 10Dzahn: [C: 03+2] people.wikimedia.org: switch backend from people2002 to people1004 [dns] - 10https://gerrit.wikimedia.org/r/930274 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [23:29:17] (03CR) 10Cwhite: [C: 03+1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [23:30:33] (03CR) 10Cwhite: [C: 03+1] profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [23:30:35] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [23:31:23] (03PS1) 10Dzahn: peopleweb: make people1004 new source, people2003 new destination [puppet] - 10https://gerrit.wikimedia.org/r/930275 (https://phabricator.wikimedia.org/T338827) [23:33:05] (03PS2) 10Dzahn: peopleweb: make people1004 new source, people2003 new destination [puppet] - 10https://gerrit.wikimedia.org/r/930275 (https://phabricator.wikimedia.org/T338827) [23:33:39] (03CR) 10Dzahn: [C: 03+2] peopleweb: make people1004 new source, people2003 new destination [puppet] - 10https://gerrit.wikimedia.org/r/930275 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [23:38:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [23:38:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe [23:40:48] (03CR) 10Dzahn: [C: 03+2] "Also we have tests for this now and I used them:" [dns] - 10https://gerrit.wikimedia.org/r/930274 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn)