[00:18:10] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:28] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536 [00:39:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536 (owner: 10TrainBranchBot) [00:41:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:41:50] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:00] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:58:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:58:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536 (owner: 10TrainBranchBot) [01:03:10] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:05:30] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:05:46] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:19:36] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:26] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:23:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:30:28] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:15] (03PS2) 10KartikMistry: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) [05:11:16] PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table enwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1196-bin.001099, end_log_pos 654625806 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:11:44] PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table plwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1156-bin.003729, end_log_pos 633898246 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:16:56] PROBLEM - Host an-worker1125 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:50] PROBLEM - Host db2110 #page is DOWN: PING CRITICAL - Packet loss = 100% [05:18:18] checking [05:19:02] PROBLEM - MariaDB Replica SQL: s7 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table arwiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1158-bin.004706, end_log_pos 225760117 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:19:18] RECOVERY - Host db2110 #page is UP: PING WARNING - Packet loss = 66%, RTA = 31.64 ms [05:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110', diff saved to https://phabricator.wikimedia.org/P48503 and previous config saved to /var/cache/conftool/dbconfig/20230525-051923-root.json [05:20:12] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:14] marostegui: Can I deploy cxserver? [05:20:19] kart_: yes [05:20:35] marostegui: Thanks [05:20:42] There's something wrong also with sanitarium db1154 [05:21:07] I deal with db1154 [05:21:14] I think I know what's going on [05:21:44] (03PS1) 10Marostegui: db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922958 [05:21:48] Amir1: what is going on? [05:22:00] PROBLEM - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 920.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:11] (03CR) 10Marostegui: [C: 03+2] db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922958 (owner: 10Marostegui) [05:22:17] I think it's flaggedrevs schema drift in sanitarium [05:22:19] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:22:28] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 909.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:32] PROBLEM - MariaDB Replica SQL: s4 on db2110 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:40] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:42] Amir1: what I briefly saw on db1154:3311 was related to enwiki.user_properties [05:22:44] PROBLEM - mysqld processes on db2110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:22:44] PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:48] PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 968.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:58] PROBLEM - MariaDB read only s4 on db2110 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:23:00] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:23:06] PROBLEM - MariaDB Replica Lag: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 945.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:08] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 947.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:27] I got this PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288 [05:23:30] PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 970.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:01] PROBLEM - MariaDB read only s4 on db2110 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:24:29] I am with db2110 [05:24:32] https://phabricator.wikimedia.org/T337445 [05:24:47] okay, it's quite noisy sigh [05:24:52] I just downtimed [05:25:06] (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857 [05:25:36] !incidents [05:25:36] 3678 (ACKED) Host db2110 (paged) - PING - Packet loss = 100% [05:25:53] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857 (owner: 10KartikMistry) [05:26:33] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857 (owner: 10KartikMistry) [05:26:33] so back to db1154/db1155 all of them seem to be trying to delete a row and not finding it [05:27:07] not one row, a different row in each section, different tables too, s7 is pagelinks [05:27:32] my guess is that somehow it got corrupted [05:28:06] yes [05:28:18] I rebooted them yesterday for the kernel thing, which doesn't explain any of this [05:28:27] But it is the most probably cause (still doesn't make sense) [05:28:43] They probably need to be entirely rebuilt [05:29:54] (03PS1) 10KartikMistry: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290) [05:30:02] I can try to fix those missing rows, but I am sure there will be more [05:30:21] I think it is better to rebuild them and hope that clouddb* hosts are ok [05:30:37] Can you create a task? [05:31:08] PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 924.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:08] PROBLEM - MariaDB Replica Lag: s7 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 924.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:20] PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 917.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:28] PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:36] PROBLEM - MariaDB Replica Lag: s7 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 951.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:37] sure [05:32:02] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 958.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:32:12] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 968.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:32:24] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 980.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:32:34] on it [05:33:06] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:33:50] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:34:31] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:35:08] RECOVERY - mysqld processes on db2110 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:35:24] RECOVERY - MariaDB read only s4 on db2110 is OK: Version 10.4.26-MariaDB-log, Uptime 46s, read_only: True, event_scheduler: True, 2635.16 QPS, connection latency: 0.004890s, query latency: 0.000560s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:36:00] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:36:20] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:36:28] RECOVERY - MariaDB Replica SQL: s4 on db2110 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:36:59] marostegui: T337446 also it might be the case that somehow replication was re-played twice? [05:36:59] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [05:37:03] I go get coffee [05:37:48] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:09] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:41:39] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:46:41] <_joe_> jouncebot: nowandnext [05:46:41] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [05:46:41] In 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600) [05:46:41] In 0 hour(s) and 13 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600) [05:48:22] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:48:39] <_joe_> Amir1, marostegui can I steal your window if you're nto doing switchovers? [05:48:47] <_joe_> I have something structural for mw on k8s [05:48:51] yep [05:48:55] <_joe_> thanks [05:48:59] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:49:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [05:49:34] (03CR) 10CI reject: [V: 04-1] mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [05:49:42] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385621020 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:50:43] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, we can merge it as is" [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [05:51:06] PROBLEM - MariaDB Replica SQL: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:51:06] PROBLEM - MariaDB Replica SQL: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:51:09] Amir1: can you downtime all wikireplicas? [05:52:18] PROBLEM - MariaDB Replica SQL: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161', diff saved to https://phabricator.wikimedia.org/P48504 and previous config saved to /var/cache/conftool/dbconfig/20230525-055236-root.json [05:53:21] (03PS1) 10Marostegui: mariadb: Decommission db1154, db1161 [puppet] - 10https://gerrit.wikimedia.org/r/923154 [05:54:25] marostegui: sure on it [05:54:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1154, db1161 [puppet] - 10https://gerrit.wikimedia.org/r/923154 (owner: 10Marostegui) [05:55:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 9 hosts with reason: T337446 [05:55:39] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [05:55:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 9 hosts with reason: T337446 [05:55:59] done [05:57:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P48506 and previous config saved to /var/cache/conftool/dbconfig/20230525-055734-root.json [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600) [06:00:06] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600). [06:01:12] (03PS3) 10Giuseppe Lavagetto: mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) [06:05:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [06:06:05] (03Merged) 10jenkins-bot: mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [06:09:01] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:19:07] RECOVERY - MariaDB Replica SQL: s1 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:41:56] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:42:26] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:42:32] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:42:42] RECOVERY - MariaDB Replica SQL: s5 on clouddb1020 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:42:42] RECOVERY - MariaDB Replica SQL: s5 on clouddb1016 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:42:42] RECOVERY - MariaDB Replica SQL: s5 on clouddb1021 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1196', diff saved to https://phabricator.wikimedia.org/P48509 and previous config saved to /var/cache/conftool/dbconfig/20230525-064418-root.json [06:44:39] (03PS11) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [06:46:47] (03PS1) 10Marostegui: db1196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923159 [06:50:04] (03CR) 10Marostegui: [C: 03+2] db1196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923159 (owner: 10Marostegui) [07:00:06] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0700). [07:00:06] matthiasmullie: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] morning1 there are no trainees signed up today and one developer with two patchsets in the window for deployment. matthiasmullie do you usually self-deploy or should we deploy for you? Sorry that I ask this every time... [07:01:00] o/ [07:01:07] I can self-deploy [07:01:11] ok [07:01:18] the first patch seems straightforward enough [07:01:30] I have a couple questions about the second one since it touches a bunch of files [07:01:38] Sure! [07:02:08] does the mainternance script that is being changed run periodically, and the deploy will be during a time when it's not liable to start running? [07:02:18] (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 [07:02:23] matthiasmullie: re the 2nd script, does it need to be on wmf.10 too? [07:02:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie) [07:03:28] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie) [07:04:20] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]] [07:04:36] apergos: it runs weekly, on Wed morning; it is not running and will not until next Wed (see https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/manifests/mediawiki/maintenance/image_suggestions.pp) [07:05:02] gotcha [07:05:22] and do you have a good method to test it on the mwdebug hosts as well as after the scap completes on the production cluster? [07:06:01] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:06:15] <_joe_> matthiasmullie: can you stop there for a sec? [07:06:25] _joe_: yes [07:06:26] <_joe_> I think I need to unlock k8s deployments for you [07:06:31] <_joe_> give me 3-4 minutes [07:06:39] sure! [07:07:03] RhinosF1: wmf.10 is not urgent; we simply need to "test" it on prod data [07:07:38] <_joe_> matthiasmullie: you can run your script manually on the mwdebug servers using mwscript IIRC [07:07:54] matthiasmullie: but by next Wednesday, wmf.10 is already going to be everywhere when it actually runs [07:08:39] apergos: I was planning to run the script manually; there are params that will output all relevant data (--verbose) while being a no-op (--quiet) [07:08:57] great! that does it for me [07:10:06] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:10:46] RhinosF1: yes, that is fine; the code currently on wmf.9 and wmf.10 is fine, and the new patch only changes how it works internally (process things via job queue) - I want to test the new patch (on wmf.9). If/once it appears it all is working well, I can either submit another backport for wmf.10, or skip that backport altogether (because the current code will still be fine) [07:10:56] <_joe_> matthiasmullie: please proceed [07:11:00] _joe_: rgr, thanks [07:12:15] matthiasmullie: makes sense [07:15:51] RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:15:57] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:16:17] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:16:41] RECOVERY - MariaDB Replica SQL: s2 on db1155 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:16:45] RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:16:52] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) 05Open→03Resolved [07:16:55] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Joe) [07:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P48511 and previous config saved to /var/cache/conftool/dbconfig/20230525-071719-root.json [07:17:54] (03PS1) 10Marostegui: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923243 [07:18:21] (03CR) 10Marostegui: [C: 03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923243 (owner: 10Marostegui) [07:18:23] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]] (duration: 14m 02s) [07:19:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922853 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [07:25:29] (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923244 [07:26:05] (03CR) 10Marostegui: [C: 03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923244 (owner: 10Marostegui) [07:34:03] (03PS1) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [07:34:21] (03CR) 10Volans: [C: 04-1] "There's a typo. I've left also a suggestion and a question inline." [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [07:35:15] (03Merged) 10jenkins-bot: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922853 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [07:35:45] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] [07:35:50] T322872: [L] Change how we send image-suggestions notifications to experienced users - https://phabricator.wikimedia.org/T322872 [07:37:16] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:45:09] (03PS2) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [07:51:56] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247 [07:51:57] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] (duration: 16m 12s) [07:52:02] T322872: [L] Change how we send image-suggestions notifications to experienced users - https://phabricator.wikimedia.org/T322872 [07:52:41] !log UTC morning backports done [07:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:26] ah, good on production too? great! [07:55:59] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:57:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder) [08:02:14] (03CR) 10Jelto: [C: 03+1] "Makes sense to remove the blackbox check from the legacy puppet code for now. According to the prometheus logs blackbox monitor still conn" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:03:10] (03CR) 10Jelto: [C: 03+2] microsites: remove http blackbox monitor for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:03:24] (03PS1) 10Ayounsi: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 [08:04:27] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:06:51] (03CR) 10Volans: Add local config files to .gitignore (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi) [08:08:35] (03PS4) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) [08:11:21] 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) [08:11:37] 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) p:05Triage→03Low [08:13:37] (03PS2) 10Ayounsi: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 [08:14:04] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi) [08:14:25] (03CR) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:15:29] (03CR) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:16:31] (03PS1) 10Ayounsi: Add the plugins directory to .gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/923251 [08:17:09] (03PS3) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [08:17:38] (03CR) 10Volans: [C: 04-1] "It's already there as homer_plugins ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/923251 (owner: 10Ayounsi) [08:18:01] (03CR) 10Ayounsi: [C: 03+2] Add local config files to .gitignore (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi) [08:18:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41331/console" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [08:18:39] (03Abandoned) 10Ayounsi: Add the plugins directory to .gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/923251 (owner: 10Ayounsi) [08:19:46] (03PS1) 10Matthias Mullie: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 [08:22:17] (03Merged) 10jenkins-bot: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi) [08:27:08] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: ensure varnish-aggregate-client-status-codes absent [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [08:27:18] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [08:32:01] !log revoke kafka_mirror_maker TLS cert (cergen based), remove old cergen certs from puppet private - T337248 [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:06] T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 [08:42:53] (03PS3) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) [08:43:50] (03PS4) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [08:45:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10jbond) > Netbox would be better. +1 this would also allow use to have them in the netbox-hiera pipeline which in turn makes it easier to add them all to... [08:47:30] (03CR) 10Slyngshede: "Rename a service to align with how be name similar services in other projects." [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [08:48:29] (03PS1) 10Marostegui: Revert "db1196: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923267 [08:48:58] (03CR) 10Marostegui: [C: 03+2] Revert "db1196: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923267 (owner: 10Marostegui) [08:49:10] (03PS4) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) [08:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48513 and previous config saved to /var/cache/conftool/dbconfig/20230525-084912-root.json [08:49:28] (03PS2) 10Jelto: trafficserver: switch annual.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) [08:49:54] 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10Joe) It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if that's been solved in the latest versions. [08:52:47] (03CR) 10Jbond: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [08:53:13] (03PS1) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) [08:53:33] (03CR) 10Jelto: [C: 03+2] trafficserver: switch annual.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto) [08:53:36] (03CR) 10CI reject: [V: 04-1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [08:57:31] (03PS1) 10KartikMistry: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838) [08:58:15] (03PS1) 10KartikMistry: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838) [08:59:06] (ProbeDown) resolved: Service miscweb2003:443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:26] (03PS1) 10Btullis: Revert "Re-enable an-test-worker1001 in the analytics_test_cluster" [puppet] - 10https://gerrit.wikimedia.org/r/923270 [09:00:02] (03CR) 10Btullis: [C: 03+2] Revert "Re-enable an-test-worker1001 in the analytics_test_cluster" [puppet] - 10https://gerrit.wikimedia.org/r/923270 (owner: 10Btullis) [09:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48514 and previous config saved to /var/cache/conftool/dbconfig/20230525-090417-root.json [09:06:19] (03PS2) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) [09:06:41] (03CR) 10CI reject: [V: 04-1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:08:54] (03PS1) 10Marostegui: mariadb: Make db2179 candidate master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/923261 (https://phabricator.wikimedia.org/T337445) [09:09:20] (03CR) 10Jbond: "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [09:10:36] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [09:11:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Make db2179 candidate master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/923261 (https://phabricator.wikimedia.org/T337445) (owner: 10Marostegui) [09:11:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2179', diff saved to https://phabricator.wikimedia.org/P48515 and previous config saved to /var/cache/conftool/dbconfig/20230525-091132-root.json [09:14:53] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [09:17:02] (03PS3) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) [09:17:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:17:35] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:17:46] (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond) [09:19:06] (03PS4) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) [09:19:14] (03PS1) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) [09:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48516 and previous config saved to /var/cache/conftool/dbconfig/20230525-091922-root.json [09:19:23] (03PS1) 10Marostegui: db2172: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/923265 [09:19:36] (03CR) 10CI reject: [V: 04-1] service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [09:19:50] (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond) [09:21:05] (03CR) 10ArielGlenn: [C: 03+2] add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:21:14] (03CR) 10Marostegui: [C: 03+2] db2172: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/923265 (owner: 10Marostegui) [09:21:28] (03CR) 10EoghanGaffney: [C: 03+2] doc: allow gitlab runners to publish docs only through `doc-gitlab` [puppet] - 10https://gerrit.wikimedia.org/r/922834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [09:21:37] apergos: good to merge your changes? [09:21:41] yes please [09:21:48] done! [09:21:51] ty! [09:22:00] (03PS3) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) [09:23:28] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41332/console" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:23:30] (03PS5) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [09:23:50] (03CR) 10CI reject: [V: 04-1] C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [09:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48517 and previous config saved to /var/cache/conftool/dbconfig/20230525-092413-root.json [09:24:24] (03PS2) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) [09:24:53] (03PS6) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 [09:25:35] (03PS1) 10ArielGlenn: Dumps: move the nfs share test conf to the right location [puppet] - 10https://gerrit.wikimedia.org/r/923289 (https://phabricator.wikimedia.org/T325232) [09:26:02] (03PS3) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) [09:26:12] (03CR) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [09:26:14] (03CR) 10Slyngshede: "I think I got it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [09:27:11] RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:27:11] RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:27:39] RECOVERY - MariaDB Replica Lag: s7 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:28:34] (03CR) 10CI reject: [V: 04-1] profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [09:29:15] (03PS1) 10Marostegui: db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923290 [09:32:22] 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) >>! In T337453#8879233, @Joe wrote: > It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if tha... [09:32:56] !log running from dumpsdata1004 via ariel login screen session, as root, rsync with bwlimit 100000 to dumpsdata1006, copying all public xml dumps data [09:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:59] (03CR) 10Jbond: gitlab: use sshkey for git-ssh public keys (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [09:34:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48518 and previous config saved to /var/cache/conftool/dbconfig/20230525-093426-root.json [09:35:22] (03PS5) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) [09:37:12] (03CR) 10Ayounsi: [C: 03+1] Add class-of-service parent interface shaper for sub-rated services (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney) [09:39:09] (03PS3) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) [09:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48519 and previous config saved to /var/cache/conftool/dbconfig/20230525-093918-root.json [09:40:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:55] (03CR) 10Jbond: [C: 03+2] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond) [09:43:02] (03PS1) 10KartikMistry: Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201) [09:44:26] Is it OK to deploy fix for cxserver ^ marostegui [09:44:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8878207, @Jclark-ctr wrote: > @ayounsi the provisioning script is still failing in row e/f. dbproxy1026 dbproxy1027 I tested there... [09:44:44] kart_: yep [09:44:52] (03PS1) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) [09:45:02] Thanks! [09:45:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:35] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201) (owner: 10KartikMistry) [09:46:27] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201) (owner: 10KartikMistry) [09:47:55] (03CR) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [09:48:11] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:48:21] (03PS1) 10Jbond: admin: add email for hghani [puppet] - 10https://gerrit.wikimedia.org/r/923293 [09:48:31] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:48:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: add email for hghani [puppet] - 10https://gerrit.wikimedia.org/r/923293 (owner: 10Jbond) [09:49:02] (03PS8) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [09:49:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48520 and previous config saved to /var/cache/conftool/dbconfig/20230525-094931-root.json [09:50:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) 05Open→03Resolved a:05CDanis→03jbond Access has now been configured and you should have received an email regarding K... [09:51:25] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:51:59] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:52:47] (03CR) 10Marostegui: [C: 03+2] db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923290 (owner: 10Marostegui) [09:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48521 and previous config saved to /var/cache/conftool/dbconfig/20230525-095341-root.json [09:54:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48522 and previous config saved to /var/cache/conftool/dbconfig/20230525-095423-root.json [09:56:58] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:57:25] (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [09:57:35] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:58:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [10:00:06] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1000). [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1000) [10:00:09] !log Updated cxserver to 2023-05-25-093623-production (config: language pairs transform fix + T331201) [10:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:16] T331201: Extract cxserver configuration and export to CSV - https://phabricator.wikimedia.org/T331201 [10:00:19] (03PS4) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) [10:00:44] (03CR) 10Elukey: "Thanks for the review John! Fixed the nits!" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:01:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41333/console" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:01:49] (03PS2) 10EoghanGaffney: Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244 [10:03:16] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41334/console" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [10:04:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48523 and previous config saved to /var/cache/conftool/dbconfig/20230525-100436-root.json [10:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48524 and previous config saved to /var/cache/conftool/dbconfig/20230525-100846-root.json [10:09:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48525 and previous config saved to /var/cache/conftool/dbconfig/20230525-100927-root.json [10:16:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:41] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:19:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48526 and previous config saved to /var/cache/conftool/dbconfig/20230525-101940-root.json [10:20:56] 10SRE, 10serviceops, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Clement_Goubert) p:05Triage→03Medium [10:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48527 and previous config saved to /var/cache/conftool/dbconfig/20230525-102351-root.json [10:24:25] (03CR) 10Abijeet Patro: [C: 04-1] ttm: use new config option to separate readable and writable services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse) [10:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48528 and previous config saved to /var/cache/conftool/dbconfig/20230525-102434-root.json [10:24:47] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2005-dev.wikimedia.org [10:28:06] (03PS3) 10EoghanGaffney: Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) [10:28:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:32:48] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [10:33:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48529 and previous config saved to /var/cache/conftool/dbconfig/20230525-103445-root.json [10:35:54] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:36:21] (03CR) 10Klausman: ml-services: update docker images for outlink (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48530 and previous config saved to /var/cache/conftool/dbconfig/20230525-103855-root.json [10:39:01] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [10:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48531 and previous config saved to /var/cache/conftool/dbconfig/20230525-103939-root.json [10:39:41] (03CR) 10Hnowlan: [C: 03+1] helmfile.d: Fix regex in api-gateway's config for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman) [10:41:44] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2005-dev: move to the new network setup [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564) [10:41:53] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [10:41:53] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:41:54] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2005-dev.wikimedia.org [10:44:37] (03CR) 10Elukey: [V: 03+1 C: 03+2] "The only side effect that I observed was:" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [10:45:32] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) 05Open→03Resolved a:03elukey [10:46:16] (03CR) 10Klausman: [C: 03+2] helmfile.d: Fix regex in api-gateway's config for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman) [10:48:08] 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10aborrero) a:05aborrero→03Jhancock.wm Please @Jhancock.wm update the physical network connection of this server from `asw-b1-codfw (WMF59... [10:48:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Thanks for working on this <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [10:48:46] !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:49:16] !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:49:27] !log klausman@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:49:30] (03CR) 10EoghanGaffney: [C: 03+2] Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [10:49:47] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [10:50:55] (03Merged) 10jenkins-bot: mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [10:51:23] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2005-dev: move to the new network setup [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564) [10:52:35] (03CR) 10Arturo Borrero Gonzalez: "We need this patch to reimage cloudcontrol2005-dev into the new network setup." [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [10:52:50] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:53:19] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:54:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48532 and previous config saved to /var/cache/conftool/dbconfig/20230525-105400-root.json [10:54:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [10:54:18] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [10:54:32] (03PS1) 10JMeybohm: modules.mesh.configuration: Copy 1.2.1 to 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303 [10:54:32] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:54:34] (03PS1) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405) [10:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48533 and previous config saved to /var/cache/conftool/dbconfig/20230525-105443-root.json [10:54:51] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:56:57] (03Abandoned) 10Jcrespo: bacula: Reschedule run of es backups codfw -> eqiad [puppet] - 10https://gerrit.wikimedia.org/r/886837 (owner: 10Jcrespo) [10:59:00] (03PS1) 10Daimona Eaytoy: Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) [10:59:13] (03PS1) 10Clément Goubert: mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071) [10:59:17] (03CR) 10Daimona Eaytoy: [C: 04-1] "Blocked on T336365" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [11:00:39] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [11:01:33] (03Merged) 10jenkins-bot: mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert) [11:03:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:03:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:04:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:05:06] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:05:23] (03CR) 10Btullis: [C: 03+2] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [11:09:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48534 and previous config saved to /var/cache/conftool/dbconfig/20230525-110905-root.json [11:09:43] !log upload udplog_1.10_amd64.deb [11:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48535 and previous config saved to /var/cache/conftool/dbconfig/20230525-110948-root.json [11:11:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] udp2log: update to take account of systemd updates [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond) [11:15:28] !log update udplog on mwlog server [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:49] (03PS1) 10Btullis: Revert "Add an extra property 'CollectMode' to each user's jupyter service" [puppet] - 10https://gerrit.wikimedia.org/r/923271 [11:16:27] (03CR) 10Btullis: [C: 03+2] Revert "Add an extra property 'CollectMode' to each user's jupyter service" [puppet] - 10https://gerrit.wikimedia.org/r/923271 (owner: 10Btullis) [11:20:56] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:21:13] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:22:21] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:22:37] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48536 and previous config saved to /var/cache/conftool/dbconfig/20230525-112409-root.json [11:25:05] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:25:18] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:25:36] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:25:53] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:26:06] (03PS1) 10Jbond: puppetmaster::common: fix lint errors and docs [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490) [11:26:44] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:26:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:27:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:56] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:28:08] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2007.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:28:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:30:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:30:30] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:31:41] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:31:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:32:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:32:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 33): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41335/console" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:32:59] Checking why PyBal is seeing them down [11:34:26] curls working for me [11:36:45] It's not logging a recovery because there's a lingering warning for schema_443 [11:38:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:38:26] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:38:54] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:39:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48537 and previous config saved to /var/cache/conftool/dbconfig/20230525-113914-root.json [11:39:26] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:39:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:39:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:39:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41336/console" [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:40:10] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:40:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:40:38] got the page and I am out at lunch [11:40:44] looking [11:40:51] can grab the laptop if needed tho [11:40:57] thank you jayme [11:41:05] (03CR) 10Hoo man: [C: 04-1] install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [11:41:05] here if you need me jayme [11:41:46] !incidents [11:41:46] 3678 (ACKED) Host db2110 (paged) - PING - Packet loss = 100% [11:41:46] 3679 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [11:41:56] !ack 3679 [11:41:56] 3679 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [11:43:10] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:43:20] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1022.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:43:33] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:44:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:44:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:45:35] mhh we are back? I will go back to lunch and page me when assistance is needed [11:46:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:48:28] godog: librenms says traffic is dropping again, yes [11:48:46] (03PS1) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [11:49:14] ack! thanks jayme [11:49:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:49:41] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:51:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:52:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:52:27] (03PS1) 10Jbond: puppetdb: Add support for submit_only_server_urls [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) [11:53:38] (03PS2) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [11:54:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:54:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:56:31] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:56:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 27): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41337/console" [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:56:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-int_4446: Servers kubernetes1022.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:56:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:57:29] there was a spike in requests from AS55839 (Jio) to upload [11:57:52] mainly from android UAs, no referer [11:58:03] 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) @Papaul @wiki_willy this server is out of warranty right? I don't know if there's much we can do about ` 2023-05-25 05:16:13 SYS1003 System CPU Resetting. ` [11:59:06] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [12:02:18] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [12:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:04:28] (03PS2) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) [12:06:08] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [12:06:41] (03CR) 10Cathal Mooney: "Looks good to me overall, but we should refactor to make 'supernetpub' an array and include 185.15.56.0/24." [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:06:49] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:10:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool sanitarium masters for s1, s5, s2, s7', diff saved to https://phabricator.wikimedia.org/P48538 and previous config saved to /var/cache/conftool/dbconfig/20230525-121012-root.json [12:10:31] (03CR) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:11:26] (03PS2) 10Jbond: install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) [12:11:59] (03CR) 10Jbond: "updated to add hostname validation" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [12:12:39] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) In a project's `.gitlab-ci.yml`, it is now possible to publish documentation and test coverage results to doc.wikimedia.org using [[ https://... [12:16:44] (03CR) 10Cathal Mooney: "Some comments back inline." [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:18:22] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [12:19:12] PROBLEM - Host releases1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:32] (03PS11) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [12:19:45] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [12:20:04] (03CR) 10Ottomata: [C: 03+1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey) [12:21:19] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41338/console" [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:23:40] (03CR) 10Jelto: [V: 03+1] gitlab: use sshkey for git-ssh public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:24:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:24:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:24:38] !incidents [12:24:39] 3680 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [12:24:39] 3678 (RESOLVED) Host db2110 (paged) - PING - Packet loss = 100% [12:24:40] 3679 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [12:24:48] RECOVERY - Host releases1003 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [12:24:52] !ack 3680 [12:24:52] 3680 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [12:25:40] !incidents [12:25:40] (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [12:25:41] 3680 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [12:25:41] 3678 (RESOLVED) Host db2110 (paged) - PING - Packet loss = 100% [12:25:41] 3679 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [12:26:39] godog: same, same [12:26:53] I'm going to craft a requestctl rule to throttle them [12:28:48] (03CR) 10Jelto: "This change is mostly to get probes for the new kubernetes services annual.wikimedia.org and 15.wikipedia.org. But I'm not sure if we need" [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:33:52] (03CR) 10AikoChou: ml-services: update docker images for outlink (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [12:35:41] (03CR) 10Elukey: [C: 03+1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [12:37:12] (03CR) 10Elukey: [C: 03+1] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm) [12:39:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:39:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:39:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:41:19] (03CR) 10Volans: [C: 04-1] "I don't think the current implementation would work" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [12:43:02] (03PS1) 10Bartosz Dziewoński: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) [12:43:15] (03PS1) 10Bartosz Dziewoński: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) [12:51:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: Add support for submit_only_server_urls [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:51:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::common: fix lint errors and docs [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:57:31] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [12:58:32] jayme: ok! happy to review [13:00:06] (03PS1) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1300) [13:00:07] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1300). [13:00:07] matthiasmullie, kart_, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:18] 0/ [13:00:34] o/ I'm only available for the first ~45 minutes [13:00:39] hi [13:00:40] I'm about to go into a meeting with my manager, sorry! D: [13:00:44] enjoy! [13:01:02] i suggest i start with MatmaRex's patches and then hand it over to kart_ / matthiasmullie for self-deployment of their patches if that's fine? [13:01:10] I can deploy my patches [13:01:12] sure [13:01:17] okay, starting! [13:01:30] (03CR) 10Urbanecm: [C: 03+2] Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:01:32] (03CR) 10Urbanecm: [C: 03+2] Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:04:31] matthiasmullie: fyi, squashing patches is not actually needed to make patches go out together/save space (multiple patches can be deployed simantinelously; on the deployment server, you can do that by `scap backport change1 change2 change3 ...`). no issues with that of course, just letting you know! [13:05:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:05:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:05:52] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [13:06:48] urbanecm: good to know; but I suppose they'd still all be queued in CI? [13:07:16] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Docker [13:07:22] well, yes, but AFAICS CI is usually able to process few patches at once. [13:09:13] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:10:16] yeah 2 parallel should not be an issue :p thanks for the headsup [13:10:46] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10cmooney) >>! In T336564#8879530, @aborrero wrote: > Please @Jhancock.wm update the physical network connection of this server from... [13:11:58] (03CR) 10Klausman: [C: 03+1] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:12:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41339/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:12:22] (03PS4) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [13:13:03] (03CR) 10Ladsgroup: [C: 03+2] Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup) [13:13:33] (03Merged) 10jenkins-bot: Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup) [13:14:31] urbanecm: so, I can also deploy two patches together. This is cool! [13:14:48] yup, you can! even if they're in multiple release branches. [13:15:06] Super. Noted. [13:18:32] (03Merged) 10jenkins-bot: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:18:35] (03Merged) 10jenkins-bot: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński) [13:19:06] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]] [13:19:11] T337436: InputBox 'prefix' is ignored when ArticleCreationWorkflow takes over the page - https://phabricator.wikimedia.org/T337436 [13:19:20] (03CR) 10Urbanecm: [C: 03+2] Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie) [13:20:38] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:20:51] MatmaRex: your patch is at mwdebug1002, can you have a look please? [13:21:33] yup. works fine now at https://en.wikipedia.org/w/index.php?title=Wikipedia:Article_wizard/CreateDraft&oldid=1116388402 [13:22:03] awesome, proceeding. [13:24:12] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [13:24:33] mine can skip mwdebug & go right ahead (only affects currently inactive maint script, only on wikis where wmf.10 is not yet live); shall we also merge kart_ patches already? [13:26:03] acknowledged [13:26:05] (03CR) 10Urbanecm: [C: 03+2] Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry) [13:26:09] (03CR) 10Urbanecm: [C: 03+2] Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry) [13:28:13] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]] (duration: 09m 06s) [13:28:18] T337436: InputBox 'prefix' is ignored when ArticleCreationWorkflow takes over the page - https://phabricator.wikimedia.org/T337436 [13:28:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie) [13:28:25] MatmaRex: your patch is live now. [13:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:40] thanks [13:29:10] kart_: I've +2'ed your patches, but I'll likely not have the time to deploy them, as I'll have to leave soon. i'll let you know once i'm done with matthias.mullie's patch! [13:29:12] np [13:29:24] (03CR) 10Jelto: [C: 03+2] "one addition: I think we could also explicitly set the ip to connect to the new kubernetes ingress. Setting the ip is also done here for e" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [13:30:21] urbanecm: sure [13:33:23] (03CR) 10JHathaway: puppet-merge: implement Lock out, tag out (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [13:33:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:34:53] (03PS1) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) [13:35:15] (03CR) 10JHathaway: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [13:35:40] (03CR) 10CI reject: [V: 04-1] puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond) [13:38:16] (03PS2) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [13:38:19] (03Merged) 10jenkins-bot: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie) [13:38:35] (03PS1) 10Jelto: miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) [13:38:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923252|Change maint script to do work via jobs]] [13:39:43] (03CR) 10Jelto: "possibly a intermediate solution beside removing the checks completely in I7c533a4308a84088a54911dd1ddfb913395766b0" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [13:40:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41340/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:41:46] (03PS4) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) [13:44:03] (03CR) 10CI reject: [V: 04-1] profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [13:44:34] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [13:44:48] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [13:45:40] (03PS2) 10Jelto: miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) [13:46:19] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:46:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923252|Change maint script to do work via jobs]] (duration: 07m 42s) [13:46:33] matthiasmullie: your patch is live. kart_, please deploy your patches once they merge! [13:46:48] urbanecm: thanks man! [13:46:51] (03Merged) 10jenkins-bot: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry) [13:46:53] (03Merged) 10jenkins-bot: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry) [13:47:02] any time [13:50:20] (03PS5) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) [13:50:48] urbanecm: Thanks! [13:50:52] Deploying.. [13:52:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] [13:52:11] T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838 [13:53:01] (03PS2) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) [13:53:03] (03PS3) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [13:53:11] (03CR) 10JHathaway: install_console: restrict options used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [13:53:32] (03CR) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [13:53:35] (03CR) 10CI reject: [V: 04-1] puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond) [13:53:37] !log kartik@deploy1002 kartik: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:53:38] (03PS6) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) [13:55:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41341/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:56:39] (03CR) 10MVernon: [C: 03+1] cassandra: add support for version 4.1.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [13:56:45] (03PS3) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) [13:56:47] (03PS4) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [13:57:15] (03Abandoned) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond) [13:58:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [13:59:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41342/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:59:34] (03PS1) 10Herron: mwlog: remove redis instance [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277) [13:59:50] (03PS1) 10BBlack: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 [14:00:38] (03PS12) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [14:00:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:00:57] (03CR) 10Volans: "post-merge thing to fix" [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [14:03:12] (My patches are stil being deployed, had issue with cache it seems so wasn't able to test it until trying hard) [14:03:37] kart_: no probs, let me know when they're done [14:03:59] and anyone else with deploys still running, as I want to lock scap after to do some LVS maintenance (T322937) [14:04:00] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [14:04:34] topranks: sure. Few minutes probably.. [14:04:49] no rush [14:05:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:05:45] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:06:12] (03PS2) 10BBlack: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 [14:06:18] topranks: looks like it is failing due to some error. [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:39] `2023-05-25 14:06:26,265 [WARNING] Issues connecting to lvs1019:9090: HTTPConnectionPool(host='lvs1019', port=9090): Max retries exceeded with url: /pools/parsoid-php_443/parse1008.eqiad.wmnet (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused'))` [14:06:41] (03CR) 10Ladsgroup: profile::configmaster: dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [14:06:46] (03PS4) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) [14:06:54] eh perhaps that was me sorry [14:07:15] (03CR) 10JMeybohm: [C: 03+1] Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack) [14:07:38] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:07:45] kart_: I'd disabled puppet and pybal in preparation on lvs1019, didn't think it would affect you though [14:07:51] re-started now so you can try again [14:07:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (to my vcl-untrained eye anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack) [14:08:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1418.eqiad.wmnet, mw1417.eqiad.wmnet, mw1416.eqiad.wmnet, mw1415.eqiad.wmnet, mw1414.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers parse1017.eqiad.wmnet, parse1011.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1447.eqiad.wmnet, mw1448.eqiad.wmnet, mw1449.eqiad.wmnet, mw1450.eqiad.w [14:08:02] marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:08:03] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 15m 56s) [14:08:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:08:08] T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838 [14:08:15] (03CR) 10BBlack: [C: 03+2] Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack) [14:08:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:08:18] PROBLEM - Host releases2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48544 and previous config saved to /var/cache/conftool/dbconfig/20230525-140822-ladsgroup.json [14:08:28] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:08:34] topranks: yeah, restarting it. [14:08:38] !log volans@cumin1001 START - Cookbook sre.puppetboard.restart-reboot rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard) [14:08:59] !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [14:09:02] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [14:09:06] `14:08:03 66 hosts had failures restarting php-fpm` [14:09:06] `14:08:03 58 hosts had failures restarting php-fpm` [14:09:06] `14:08:03 24 hosts had failures restarting php-fpm` [14:09:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:12] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:09:34] I'm re-running scap backport [14:09:44] paged for parsoid, I'm assuming that's related to the ongoing deployment ? [14:09:51] what's going on with that and the pybal parsiod-php thing? [14:09:53] !log volans@cumin1001 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard) [14:10:04] !log kartik@deploy1002 Started scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] [14:10:06] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:06] /wiki/Services/Monitoring/restbase [14:10:06] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:06] /wiki/Services/Monitoring/restbase [14:10:06] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:10:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:08] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:08] /wiki/Services/Monitoring/restbase [14:10:12] godog: no. Something else. [14:10:16] ok caught up a bit, I get the pybal part now, ignore that [14:10:17] bblack: looks like some coordination for work on lvs1019 [14:10:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1418.eqiad.wmnet, mw1417.eqiad.wmnet, mw1416.eqiad.wmnet, mw1415.eqiad.wmnet, mw1414.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1447.eqiad.wmnet, mw1448.eqiad.wmnet, mw1449.eqiad.wmnet, mw1450.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:10:24] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or [14:10:24] ESTBase [14:10:25] kart_: ack, thank you [14:10:32] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:32] /wiki/Services/Monitoring/restbase [14:10:32] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:32] /wiki/Services/Monitoring/restbase [14:10:34] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:34] /wiki/Services/Monitoring/restbase [14:10:36] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:36] /wiki/Services/Monitoring/restbase [14:10:36] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:36] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:38] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:38] /wiki/Services/Monitoring/restbase [14:10:40] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:40] /wiki/Services/Monitoring/restbase [14:10:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:10:40] /wiki/Services/Monitoring/restbase [14:10:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech. [14:10:40] a.org/wiki/RESTBase [14:10:43] bblack: I may have messed up, had tried to disable PyBal on lvs1019, expecting failover to lvs1020 [14:10:44] (03PS5) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [14:10:46] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was rece [14:10:46] domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) timed [14:10:46] re a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:10:52] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-ht [14:10:52] e} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test [14:10:52] urned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:11:02] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or [14:11:02] ESTBase [14:11:04] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:04] /wiki/Services/Monitoring/restbase [14:11:07] topranks: yeah there's an outstanding problem on the scap side that requires us to not do LVS maintenance during deploys :/ [14:11:10] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:10] /wiki/Services/Monitoring/restbase [14:11:10] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or [14:11:10] ESTBase [14:11:10] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:10] /wiki/Services/Monitoring/restbase [14:11:10] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or [14:11:10] !incidents [14:11:11] 3681 (UNACKED) ProbeDown sre (10.2.2.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 eqiad) [14:11:11] ESTBase [14:11:11] 3680 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [14:11:11] 3678 (RESOLVED) Host db2110 (paged) - PING - Packet loss = 100% [14:11:11] 3679 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqsin.wikimedia.org) [14:11:12] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:12] /wiki/Services/Monitoring/restbase [14:11:12] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:14] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:14] /wiki/Services/Monitoring/restbase [14:11:14] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:14] /wiki/Services/Monitoring/restbase [14:11:15] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:15] /wiki/Services/Monitoring/restbase [14:11:16] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:16] /wiki/Services/Monitoring/restbase [14:11:17] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:17] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:11:17] bblack: reversed that anyway PyBal back 2m30s [14:11:18] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:18] /wiki/Services/Monitoring/restbase [14:11:19] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:19] /wiki/Services/Monitoring/restbase [14:11:20] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:20] /wiki/Services/Monitoring/restbase [14:11:20] !ack 3681 [14:11:21] 3681 (ACKED) ProbeDown sre (10.2.2.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 eqiad) [14:11:21] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:21] /wiki/Services/Monitoring/restbase [14:11:22] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:11:22] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [14:11:23] e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:11:23] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{dom [14:11:24] media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title [14:11:25] TICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:11:25] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:26] /wiki/Services/Monitoring/restbase [14:11:26] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:27] /wiki/Services/Monitoring/restbase [14:11:27] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:28] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:28] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:29] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:29] /wiki/Services/Monitoring/restbase [14:11:30] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or [14:11:30] ESTBase [14:11:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:41] !log kartik@deploy1002 kartik: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:11:42] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [14:11:42] /wiki/Services/Monitoring/restbase [14:12:00] bblack: my bad yes, sukhe explained the order of ops properly to me and I messed up and shut pybal ahead of locking deploys [14:12:02] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:12:03] I suspect they're all depooled [14:12:05] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:12:07] I'll wait for a while to go ahead then.. [14:12:10] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:12:19] kart_: yes, sorry! [14:12:32] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.28:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:51] time to update https://www.wikimediastatus.net/ ? [14:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48545 and previous config saved to /var/cache/conftool/dbconfig/20230525-141318-ladsgroup.json [14:13:20] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.28:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:13:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41343/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:13:39] I 'll pull all of parsoid in eqiad [14:13:39] (MediaWikiLatencyExceeded) firing: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc [14:13:42] (MediaWikiLatencyExceeded) firing: Average latency high: ... [14:13:45] it's depooled per https://config-master.wikimedia.org/pybal/eqiad/parsoid-php [14:13:48] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:14:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:20] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: service=parsoid-php,dc=eqiad [14:14:26] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:14:30] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:14:37] ^ my conftool above should mitigate [14:14:46] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:14:50] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:14:52] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:14:58] (03CR) 10JHathaway: [C: 03+1] puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:15:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:16] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:15:22] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:15:50] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:15:56] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:15:56] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:15:58] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:00] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:26] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:31] we also seem to have all mw servers in eqiad depooled [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:44] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:16:52] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:52] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:52] urbanecm: right now? which service keys? [14:16:56] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:05] here [14:17:14] bblack: https://config-master.wikimedia.org/pybal/eqiad/appservers-https claims enabled:false [14:17:32] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:34] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:36] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:41] are you performing maintenance or are you experiencing technical problem? [14:17:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:17:46] RECOVERY - Host releases2003 is UP: PING OK - Packet loss = 0%, RTA = 31.97 ms [14:18:00] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:00] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:00] AzaTht: we're experiencing a technical problem, but we're on it. please be patient. [14:18:04] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:18:16] (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:18:21] 10SRE, 10SRE-Unowned, 10Discovery-Search, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10jbond) [14:18:29] urbanecm: (thumbs up) [14:18:34] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:34] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:18:38] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:18:52] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:02] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:19:04] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:19:04] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:06] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:06] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:06] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:13] hi [14:19:14] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:19:16] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:22] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:19:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:19:49] urbanecm: I was in the middle of deployment and then it failed with lvs errors and then restarted scap - now I'm seating at patch deployed on mwdebug to test and waiting till above issue is fixed :/ [14:20:02] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:08] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:10] this is probably a result of https://phabricator.wikimedia.org/T334703 [14:20:34] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:20:36] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:40] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:40] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:54] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:57] kart_: it's not your fault at all. but yes, let's wait for the sites to be up and then deployment can be finished [14:21:06] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:21:12] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:19] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=appserver,dc=eqiad [14:21:24] (03PS1) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) [14:21:29] urbanecm: yes. Half deployed thing - I'll wait. [14:21:32] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:36] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:38] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=eqiad [14:21:38] repooling appservers in eqiad [14:21:42] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:59] I suggest setting https://www.wikimediastatus.net/ more than "editing issues", I'm just getting a generic Error message atm [14:22:02] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:03] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver [14:22:04] (03PS6) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [14:22:07] (03PS2) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) [14:22:10] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:10] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:10] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:10] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:10] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:22:10] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:22:12] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:22:14] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:14] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072'] [14:22:14] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:22:18] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:22] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:26] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=jobrunner [14:22:28] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:22:36] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler [14:22:38] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:22:42] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:00] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:23:04] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:10] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:10] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:14] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:23:21] jobrunners/videoscalers repooled [14:23:22] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:23:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41344/console" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:24:07] (ProbeDown) resolved: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:30] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:25:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1026'] [14:25:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022'] [14:26:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022'] [14:26:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022'] [14:26:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:26:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022'] [14:26:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:26:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:26:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1023'] [14:27:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022'] [14:27:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022'] [14:27:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1023'] [14:27:37] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1023'] [14:27:40] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:27:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022'] [14:28:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1022'] [14:28:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1023'] [14:28:12] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1024'] [14:28:16] (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:28:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1025'] [14:28:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P48546 and previous config saved to /var/cache/conftool/dbconfig/20230525-142824-ladsgroup.json [14:28:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1024'] [14:28:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1025'] [14:28:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1026'] [14:28:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1027'] [14:28:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:29:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1027'] [14:29:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:29:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbproxy1026'] [14:30:07] (ProbeDown) resolved: (9) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [14:31:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:52] (03CR) 10JHathaway: [C: 03+1] "looks good, typo in commit msg 😊" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:32:18] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [14:32:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [14:33:16] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [14:33:27] (03CR) 10Gmodena: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:33:38] (03PS7) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) [14:33:40] (03PS3) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) [14:33:42] (03PS1) 10Jbond: puppetmaster: fix puppetdb_submit_only_hosts [puppet] - 10https://gerrit.wikimedia.org/r/923356 [14:34:41] (03PS13) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) [14:34:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:36:42] (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:36:44] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:36:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:29] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [14:38:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:30] (03CR) 10Volans: profile::configmaster: dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto) [14:40:02] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [14:40:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:40:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41345/console" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:40:36] Are we good to continue deployment in-progress? [14:41:36] bblack jayme ^ what do you think ? [14:42:26] I think we're okay to resume, I'd like a second opinion though [14:42:37] Sure. I'll wait. [14:43:16] we still have elevated latencies on appservers, let's see if that settles first [14:43:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P48547 and previous config saved to /var/cache/conftool/dbconfig/20230525-144330-ladsgroup.json [14:44:04] good call yeah [14:44:28] for reference, the alert's dashboard for the appserver latency in this case is https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad+prometheus%2Fops&var-method=GET&viewPanel=9 [14:45:08] Yeah it's specifically api_appservers [14:45:31] that's right api_appservers, my bad [14:51:22] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:54:01] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:54:52] !log Wikireplicas are lagging behind for the following sections: s1, s2, s5, s7 T337446 [14:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [14:57:04] (03CR) 10Dzahn: "Oh, interesting option I had not even considered. Seems like a good idea until everything is in service catalog, which yes, has a lot of o" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [14:58:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48548 and previous config saved to /var/cache/conftool/dbconfig/20230525-145836-ladsgroup.json [14:58:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:58:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:58:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48549 and previous config saved to /var/cache/conftool/dbconfig/20230525-145857-ladsgroup.json [15:03:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48550 and previous config saved to /var/cache/conftool/dbconfig/20230525-150347-ladsgroup.json [15:03:52] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:04:38] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr1-eqiad,lsw1-e1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad [15:04:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-eqiad,lsw1-e1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad [15:05:00] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=03f7b2ab-bdea-4c56-ac41-3ec30004db4a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s... [15:05:55] jayme: just ping me when it is OK to deploy. Or should I abandon it? It is deployed in wmf.9/mwdebug servers. [15:06:13] Patch: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/923268 (wmf.9) [15:07:12] kart_: latency is trending down and almost back to normal. I'd say we will be clear in a couple of minues [15:07:54] +1 [15:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [15:08:16] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:08:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: maintenance [15:08:55] eheh [15:09:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: maintenance [15:10:41] !log gerrit-replica.wikimedia.org - gerrit2002 - reimaging - scheduled maintenance [15:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:48] !log Migrating cr1-eqiad downlink to row E/F from lsw1-e1-eqiad et-0/0/48 to ssw1-e1-eqiad et-0/0/31 [15:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] kart_ godog: I would say go ahead (cc bblack | urandom) [15:11:41] jayme: thanks. [15:12:14] Going ahead.. [15:14:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bullseye [15:15:04] agreed [15:15:38] (03PS1) 10Aklapper: Automate yearly Phabricator metrics for wikitech-l [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) [15:17:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [15:17:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) p:05Triage→03High [15:18:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) [15:18:12] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 68m 07s) [15:18:16] T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838 [15:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P48551 and previous config saved to /var/cache/conftool/dbconfig/20230525-151853-ladsgroup.json [15:20:13] jayme: looks good on wmf.9 [15:20:23] cool, thanks! [15:20:29] !log kartik@deploy1002 Started scap: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] [15:20:34] jayme: finishing pending on wmf.10 [15:20:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-eqiad,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad [15:21:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-eqiad,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad [15:21:08] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf76e0ba-8648-48a0-beed-fe7b60f79656) set by cmooney@cumin1001 for 0:30:00 on 2 host(s... [15:21:58] !log kartik@deploy1002 kartik: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:24:54] (03PS1) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) [15:27:30] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 07m 01s) [15:27:35] T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838 [15:28:00] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10elukey) The ML team is serving its Lift Wing model servers via the API gateway, so we'd benefit as well to have edge caching :) [15:28:19] jayme: I'm all done. Thanks a lot, SREs for taking care of issues. [15:28:35] ack, thanks! [15:28:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [15:28:41] (and all those who were helping, ofcourse!) [15:28:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [15:30:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [15:30:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:56] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [15:33:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e[1-2]-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad [15:33:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e[1-2]-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad [15:34:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P48552 and previous config saved to /var/cache/conftool/dbconfig/20230525-153359-ladsgroup.json [15:34:03] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8f44dd48-0cac-4bfd-907a-512dfa686d40) set by cmooney@cumin1001 for 0:30:00 on 2 host(s... [15:34:33] (03CR) 10Jbond: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [15:37:51] (03CR) 10Ottomata: [C: 03+1] "Assuming this is old, can we abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [15:38:27] (03CR) 10Ottomata: "Should we merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [15:43:36] !log dancy@deploy1002 Started deploy [integration/docroot@78e6f40]: (no justification provided) [15:43:46] !log dancy@deploy1002 Finished deploy [integration/docroot@78e6f40]: (no justification provided) (duration: 00m 10s) [15:44:05] !log dancy@deploy1002 Updated scap URLs on doc.wikimedia.org [15:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T336886)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20230525-154906-ladsgroup.json [15:49:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [15:49:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [15:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T336886)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20230525-154927-ladsgroup.json [15:49:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:49:48] !log dancy@deploy1002 Started deploy [integration/docroot@dac2b70]: Updated Scap URLs [15:49:56] !log dancy@deploy1002 Finished deploy [integration/docroot@dac2b70]: Updated Scap URLs (duration: 00m 07s) [15:50:01] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:25] do we have some known reason for the phab thing above? [15:50:37] (03CR) 10Ottomata: "Okay, so. A reason why this will make metrics and dashboarding weird:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [15:50:59] not that I'm aware of yet, looks like v6 only tho [15:51:26] maybe the row E/F -related work? [15:51:34] (03CR) 10Ottomata: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [15:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48553 and previous config saved to /var/cache/conftool/dbconfig/20230525-155139-ladsgroup.json [15:51:46] hmmm that's in row B though [15:52:02] phabricator itself seems ok? [15:52:21] I wonder why we're paging on an individual server and not the overall service? [15:52:22] yeah I see the probe recovered, the alert will recover shortly I'm assuming [15:53:26] bblack: good question, I'm making a note to dig deeper tomorrow on why is that [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:38] I guess the alert is for the real public IP, it's just associated to the currently-active individual server [15:54:48] so, while it's confusing, it does make sense to page [15:55:01] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:45] mmhh yeah confusing alright, the probed name should be there not the hostname [15:56:48] FWIW related task is https://phabricator.wikimedia.org/T312840 and obviously I haven't got around to it yet [15:56:57] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1 [15:56:57] (03PS1) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) [15:57:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1 [15:57:16] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c43be552-7ced-4f58-99c1-a10b5984bf3a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s... [15:57:25] (03PS1) 10BBlack: Bugfix for hotlink URL patch earlier [puppet] - 10https://gerrit.wikimedia.org/r/923375 [15:57:51] (03CR) 10CI reject: [V: 04-1] Bugfix for hotlink URL patch earlier [puppet] - 10https://gerrit.wikimedia.org/r/923375 (owner: 10BBlack) [15:58:50] (03PS1) 10Elukey: helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:40] (03CR) 10CI reject: [V: 04-1] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [15:59:45] (03PS2) 10BBlack: fix the UA matching in the earlier hotlink patch [puppet] - 10https://gerrit.wikimedia.org/r/923375 [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:52] (03CR) 10BBlack: [C: 03+2] fix the UA matching in the earlier hotlink patch [puppet] - 10https://gerrit.wikimedia.org/r/923375 (owner: 10BBlack) [16:01:19] (03CR) 10Ladsgroup: "recheck" [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [16:02:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bullseye [16:04:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:10] 10SRE, 10LDAP-Access-Requests: Grant Access to developer account/wmf for Amal Ramadan - https://phabricator.wikimedia.org/T337492 (10Aklapper) Hi @ARamadan-WMF, thanks for taking the time to report this! I assume this is about https://wikitech.wikimedia.org/wiki/Special:CreateAccount ? Which exact wiki usernam... [16:06:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P48555 and previous config saved to /var/cache/conftool/dbconfig/20230525-160645-ladsgroup.json [16:07:09] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:07:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on gerrit2002.wikimedia.org with reason: maintenance [16:07:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gerrit2002.wikimedia.org with reason: maintenance [16:07:57] ACKNOWLEDGEMENT - Check systemd state on gerrit2002 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service daniel_zahn https://phabricator.wikimedia.org/T334521 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:57] ACKNOWLEDGEMENT - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site daniel_zahn https://phabricator.wikimedia.org/T334521 https://wikitech.wikimedia.org/wiki/Gerrit [16:11:15] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e3-eqiad uplinks to spine [16:11:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e3-eqiad uplinks to spine [16:11:36] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=37545969-c51e-450d-9ef0-5fadfd151520) set by cmooney@cumin1001 for 0:30:00 on 3 host(s... [16:14:20] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:14:24] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:12] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:18:16] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:18:37] (03PS1) 10Dzahn: gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) [16:19:31] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:20:44] (03CR) 10Hashar: [C: 03+1] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [16:20:53] (03CR) 10Dzahn: [C: 03+2] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [16:21:07] (03PS2) 10Dzahn: gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) [16:21:21] (03CR) 10Dzahn: [V: 03+2] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [16:21:38] (03CR) 10Func: "It seems the `wmf_deploy` branch should be used instead. Not sure how that works." [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [16:21:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P48556 and previous config saved to /var/cache/conftool/dbconfig/20230525-162151-ladsgroup.json [16:22:18] (03CR) 10Func: BannerRenderer: Make sure the language variant is valid (031 comment) [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [16:28:39] (03Abandoned) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [16:29:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, I've setup the Kerberos authentication but I am having trouble signing into Jupyterhub and Wikimedia Dev single sign on: {F37034637} {F370346... [16:29:19] 10SRE, 10PyBal, 10Release-Engineering-Team, 10Scap, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) [16:34:17] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:34:33] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:34:48] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:34:57] (03CR) 10Cathal Mooney: [C: 03+2] Add class-of-service parent interface shaper for sub-rated services (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney) [16:35:40] (03Merged) 10jenkins-bot: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney) [16:36:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48557 and previous config saved to /var/cache/conftool/dbconfig/20230525-163657-ladsgroup.json [16:37:03] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:37:29] (03PS2) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [16:37:48] (03PS1) 10Cathal Mooney: Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937) [16:39:03] jouncebot: nowandnext [16:39:03] For the next 0 hour(s) and 20 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1600) [16:39:03] In 0 hour(s) and 20 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700) [16:39:03] In 0 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700) [16:39:25] (03PS2) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) [16:39:27] (03PS2) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [16:39:29] (03PS3) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [16:39:48] !log adding outbound shaper config on eqsin to codfw transport cct (T328313) [16:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:52] T328313: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 [16:41:58] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41346/console" [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:42:29] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41347/console" [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:42:58] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41348/console" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:46:56] (03CR) 10Cathal Mooney: [C: 03+2] Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [16:47:16] (03PS1) 10BBlack: pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703) [16:47:35] (03Merged) 10jenkins-bot: Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [16:48:33] (03PS3) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) [16:48:35] (03PS3) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [16:48:37] (03PS4) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [16:48:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [16:49:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved Merged and shapers set on codfw to eqsin link. [16:50:33] (03PS1) 10David Caro: wmcs-backup: a couple fixes [puppet] - 10https://gerrit.wikimedia.org/r/923390 [16:51:39] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10wiki_willy) a:03Jhancock.wm Hi @Marostegui - Papaul is on paternity leave for another week, so I'm going to pass this over to @Jhancock.wm to check out. The server is about 4yrs old, so it's out of warranty, but there... [16:52:22] (03PS4) 10Aqu: analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) [16:53:20] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41349/console" [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:53:26] (03CR) 10Aqu: "Rebased and ready for merge." [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [16:54:23] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41350/console" [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:54:40] (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391 [16:55:29] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41351/console" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [16:55:45] (03PS4) 10Robertsky: Change project logo for Wikimania to Wikimania 2023 version T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 [16:56:14] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393 [16:57:21] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391 (owner: 10BryanDavis) [16:57:58] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393 (owner: 10BryanDavis) [16:58:14] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391 (owner: 10BryanDavis) [16:59:04] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393 (owner: 10BryanDavis) [17:00:06] bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700) [17:00:59] o/ I have deploys for toolhub and developer-portal today. I'll start on them fairly soon. [17:01:42] (03PS1) 10Cathal Mooney: Adjust Eqiad row E/F switch parents in hierdata after cable moves [puppet] - 10https://gerrit.wikimedia.org/r/923395 (https://phabricator.wikimedia.org/T322937) [17:02:14] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Yeah, I wonder if there's anything we can do to troubleshoot this from a hardware point of view. [17:03:20] (03PS1) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:03:53] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:05:08] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:05:22] (03CR) 10CI reject: [V: 04-1] ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:06:13] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [17:06:13] (03PS2) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:06:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) We migrated a bunch of network <-> network links today without issue (crossed them out in above table). Didn't touch the LVS's aft... [17:07:33] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [17:08:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41353/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:08:26] (03CR) 10CI reject: [V: 04-1] ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:08:43] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [17:09:19] (03PS3) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:09:49] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [17:12:39] (03PS4) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:12:56] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:13:18] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:14:09] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:14:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41354/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:14:44] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:14:59] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:14:59] (03CR) 10Ssingh: [C: 03+2] pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [17:15:30] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:15:39] (03CR) 10Ssingh: [V: 03+2 C: 03+2] pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [17:17:19] (03PS5) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:17:38] * bd808 is done deploying things [17:17:47] forever? :) [17:18:23] (03CR) 10Ottomata: [C: 03+2] analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [17:18:24] eh. probably jut for May 2023 :) [17:19:43] (03PS6) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 [17:21:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41356/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:22:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41357/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [17:22:49] (03PS1) 10Ssingh: Release 1.15.12 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923399 (https://phabricator.wikimedia.org/T334703) [17:23:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48558 and previous config saved to /var/cache/conftool/dbconfig/20230525-172326-root.json [17:23:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48559 and previous config saved to /var/cache/conftool/dbconfig/20230525-172413-root.json [17:25:10] (03CR) 10Ssingh: [C: 03+2] Release 1.15.12 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923399 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh) [17:26:20] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entires for migration IPs eqiad row E F switches. - cmooney@cumin1001" [17:27:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entires for migration IPs eqiad row E F switches. - cmooney@cumin1001" [17:27:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48561 and previous config saved to /var/cache/conftool/dbconfig/20230525-173831-root.json [17:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48562 and previous config saved to /var/cache/conftool/dbconfig/20230525-173918-root.json [17:41:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Step 2 - Move CR Uplinks has now been completed. We are also 50% of the way through steps 3 and 4. Will continue with... [17:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48563 and previous config saved to /var/cache/conftool/dbconfig/20230525-175335-root.json [17:54:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48564 and previous config saved to /var/cache/conftool/dbconfig/20230525-175423-root.json [17:56:43] (03PS1) 10BBlack: pybal: quick bugfix for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923404 (https://phabricator.wikimedia.org/T334703) [18:00:06] ^demon and dancy: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1800). Please do the needful. [18:02:43] (03CR) 10Ssingh: [C: 03+2] pybal: quick bugfix for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923404 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [18:05:52] (03PS1) 10Ssingh: Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703) [18:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48565 and previous config saved to /var/cache/conftool/dbconfig/20230525-180840-root.json [18:08:51] (03CR) 10Ssingh: [C: 03+2] Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh) [18:09:12] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh) [18:09:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48566 and previous config saved to /var/cache/conftool/dbconfig/20230525-180927-root.json [18:15:59] (03PS1) 10Cathal Mooney: Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) [18:20:17] (03PS2) 10Cathal Mooney: Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) [18:20:44] (03PS1) 10Kimberly Sarabia: Reapply new fix to en beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) [18:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48567 and previous config saved to /var/cache/conftool/dbconfig/20230525-182345-root.json [18:24:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48568 and previous config saved to /var/cache/conftool/dbconfig/20230525-182432-root.json [18:30:41] jouncebot: now [18:30:41] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1800) [18:38:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48570 and previous config saved to /var/cache/conftool/dbconfig/20230525-183849-root.json [18:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48571 and previous config saved to /var/cache/conftool/dbconfig/20230525-183937-root.json [18:39:41] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: a couple fixes [puppet] - 10https://gerrit.wikimedia.org/r/923390 (owner: 10David Caro) [18:42:31] dancy: no train deployment today, correct? sorry just checking since we will do a scap lock to test some LVS changes to prevent future scap locks when doing LVS change :) [18:43:12] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@6b27584]: (no justification provided) [18:43:31] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@6b27584]: (no justification provided) (duration: 00m 19s) [18:43:35] (03PS14) 10Andrew Bogott: backy2: Prepare for switch to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) [18:43:37] (03PS1) 10Andrew Bogott: backy2: switch from sqlite to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/923410 (https://phabricator.wikimedia.org/T332734) [18:46:27] (03CR) 10Andrew Bogott: [C: 03+2] backy2: Prepare for switch to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [18:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48572 and previous config saved to /var/cache/conftool/dbconfig/20230525-185354-root.json [18:54:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48573 and previous config saved to /var/cache/conftool/dbconfig/20230525-185441-root.json [18:57:02] (03PS2) 10Andrew Bogott: backy2: switch from sqlite to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/923410 (https://phabricator.wikimedia.org/T332734) [18:57:04] (03PS1) 10Andrew Bogott: backy2: include python3-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/923412 (https://phabricator.wikimedia.org/T332734) [18:59:23] (03CR) 10Andrew Bogott: [C: 03+2] backy2: include python3-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/923412 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [19:00:52] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567 [19:02:28] (03CR) 10Jdlrobson: Enable the new Special:Contribute page entry point for desktop on selected wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [19:02:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder) [19:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48574 and previous config saved to /var/cache/conftool/dbconfig/20230525-190859-root.json [19:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48575 and previous config saved to /var/cache/conftool/dbconfig/20230525-190946-root.json [19:17:30] (03CR) 10Stevemunene: [V: 03+1] Update Puppet files for Airflow Upgrade to 2.3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [19:18:25] (03PS1) 10Jdrewniak: Use document feature classes to extract A/B test state [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972) [19:19:45] (03PS2) 10DCausse: ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) [19:20:24] (03CR) 10DCausse: ttm: use new config option to separate readable and writable services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse) [19:22:19] (03CR) 10Jdrewniak: [C: 03+1] Reapply new fix to en beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [19:24:59] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney) [19:27:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:24] sukhe: It looks like the train should be unblocked now. Demon do you plan to roll forward today? [19:29:28] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [19:29:57] sukhe: Feel free to hold the scap lock as needed. [19:31:25] !log bblack@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add pybal-low-traffic.svc.codfw.wmnet - bblack@cumin1001" [19:32:06] dancy: thanks, we will let you know [19:32:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:15] feel free to proceed for now, thanks [19:32:30] !log bblack@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add pybal-low-traffic.svc.codfw.wmnet - bblack@cumin1001" [19:32:30] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:33:22] we will let you know here if we block scap but assume no for now. [19:38:09] (03PS1) 10BBlack: Add pybal-low-traffic.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923414 (https://phabricator.wikimedia.org/T334703) [19:38:41] Ok [19:39:59] (03CR) 10BBlack: [C: 03+2] Add pybal-low-traffic.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923414 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [19:52:58] (03PS2) 10Jdrewniak: Enable Vector "Zebra" AB test to enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [19:55:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:56:28] (03PS1) 10Bartosz Dziewoński: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) [20:00:06] brennen and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T2000). [20:00:06] kimberly_sarabia, Daimona, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:14] hello [20:00:38] (03PS2) 10Bartosz Dziewoński: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) [20:00:43] hi [20:00:57] hi, I can deploy :) [20:01:09] whew [20:01:12] :p [20:01:24] kimberly_sarabia: going to start with your beta config patch, 923407, get that out the way [20:01:27] o/ [20:01:44] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [20:01:49] TheresNoTime: Thanks [20:01:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:02:19] (03PS5) 10Samtar: [prod] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy) [20:02:35] i am struggling a bit with my backport, but i should have it sorted out in a few minutes [20:02:52] Daimona: I'll then do your config patch, 919838, while the vector one merges if that's okay? [20:03:01] the UBNs always come in minutes before the last deployment slot of the week [20:03:02] Sure, ty! [20:03:05] (03Merged) 10jenkins-bot: Enable Vector "Zebra" AB test to enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:03:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy) [20:04:37] (03Merged) 10jenkins-bot: [prod] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy) [20:05:06] !log samtar@deploy1002 Started scap: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]] [20:05:10] T337365: Enable CampaignEvents logging in beta and production - https://phabricator.wikimedia.org/T337365 [20:05:21] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10wiki_willy) a:03Jhancock.wm [20:05:45] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337276 (10wiki_willy) a:03Jhancock.wm [20:06:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:40] !log samtar@deploy1002 samtar and daimona: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:06:57] Daimona: that's live on mwdebug, can you test? [20:07:29] I don't think it's testable, because no logs can be generated for that channel yet [20:07:46] will sync :) [20:07:46] The best I could do is use shell.php to log something manually, but I'm not sure if it's desirable [20:07:54] Or if there's a smarter way to do that [20:08:16] I've started to sync now [20:08:27] Ok, thanks :) [20:08:32] kimberly_sarabia: your config change should be on beta now-ish [20:08:48] TheresNoTime: Thanks [20:11:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:27] (03PS1) 10Hashar: wm-patch-demo: use WARNING to prevent chipset collapsing [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923418 (https://phabricator.wikimedia.org/T332474) [20:12:35] jouncebot: next [20:12:35] In 9 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230526T0600) [20:12:50] any deploys from above still ongoing? [20:13:01] bblack: yes [20:13:06] ok [20:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:13:37] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]] (duration: 08m 31s) [20:13:40] there's two wmf.10 backports left to do [20:13:42] T337365: Enable CampaignEvents logging in beta and production - https://phabricator.wikimedia.org/T337365 [20:13:51] Daimona: live on prod :) [20:14:41] Amazing, thank you! [20:14:58] kimberly_sarabia: moving on to 923281, needs a few more minutes to merge though [20:15:05] ok [20:15:49] bblack: did you want me to hold off of starting the merge for the next .10 backport? [20:15:55] no go ahead [20:15:58] ack :) [20:16:07] I'm just waiting for an idle time to lock up scap and do some SRE-level things later [20:18:27] MatmaRex: did you sort your patch, can I start it merging? [20:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:03] TheresNoTime: if it passes tests, yes. i'm waiting to confirm that :D [20:19:24] looks like it has? [20:19:41] yep [20:20:05] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) (owner: 10Bartosz Dziewoński) [20:20:31] (03Merged) 10jenkins-bot: Use document feature classes to extract A/B test state [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [20:21:06] !log samtar@deploy1002 Started scap: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]] [20:21:10] T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972 [20:22:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder) [20:22:35] !log samtar@deploy1002 jdrewniak and samtar: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:22:59] kimberly_sarabia - live on mwdebug, can you test? [20:23:06] sure [20:26:11] TheresNoTime: LGTM! [20:26:16] (03Abandoned) 10Ottomata: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [20:26:17] syncing :) [20:32:04] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]] (duration: 10m 58s) [20:32:08] and live :) [20:32:09] T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972 [20:34:26] TheresNoTime: TYSM! [20:34:33] you're welcome! [20:37:36] (03Merged) 10jenkins-bot: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) (owner: 10Bartosz Dziewoński) [20:38:23] !log samtar@deploy1002 Started scap: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]] [20:38:28] T337515: OOUI dialogs with tabs can't be interacted with (except the last tab), e.g. VE image dialog - https://phabricator.wikimedia.org/T337515 [20:40:04] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:40:07] MatmaRex: live on mwdebug [20:40:33] looking [20:41:20] TheresNoTime: looks good, thank you [20:41:26] syncing [20:45:01] (03Restored) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [20:45:20] (03PS2) 10Thcipriani: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [20:45:51] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:55] heh, 20 seconds slower [20:46:58] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]] (duration: 08m 34s) [20:47:02] MatmaRex: and live :) [20:47:03] T337515: OOUI dialogs with tabs can't be interacted with (except the last tab), e.g. VE image dialog - https://phabricator.wikimedia.org/T337515 [20:47:05] thanks! [20:47:27] bblack: done with the backports afaik [20:47:41] !log close UTC late backport [20:47:43] (03CR) 10CI reject: [V: 04-1] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [20:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:51:05] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:05] (03PS7) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [20:56:19] (03CR) 10BCornwall: "Thanks for all the review, everyone." [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [20:58:59] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [21:02:36] (03CR) 10Ayounsi: [C: 03+1] Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney) [21:14:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:53] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@77cf676]: (no justification provided) [21:26:02] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@77cf676]: (no justification provided) (duration: 00m 08s) [21:42:49] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:53] (03PS1) 10Effie Mouzeli: conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) [21:43:05] (03CR) 10CI reject: [V: 04-1] conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) (owner: 10Effie Mouzeli) [21:43:53] (03PS2) 10Effie Mouzeli: conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) [21:51:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:55:16] (03PS3) 10Zabe: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [21:56:39] (03PS1) 10Zabe: Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536) [21:57:02] (03PS4) 10Zabe: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [22:01:53] (03CR) 10Zabe: [C: 03+2] Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536) (owner: 10Zabe) [22:02:07] (03CR) 10Zabe: [C: 03+2] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [22:04:22] (03Merged) 10jenkins-bot: Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536) (owner: 10Zabe) [22:04:27] (03Merged) 10jenkins-bot: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup) [22:05:28] !log zabe@deploy1002 Started scap: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]] [22:05:34] T337427: LanguageConverter: Call to member function replace() on null - https://phabricator.wikimedia.org/T337427 [22:05:34] T335536: Hard deprecate class Hooks with all deprecated functions (and remove in 1.42) - https://phabricator.wikimedia.org/T335536 [22:06:59] !log zabe@deploy1002 zabe and ladsgroup: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:10:07] (03PS8) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [22:13:03] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:14:42] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]] (duration: 09m 14s) [22:14:48] T337427: LanguageConverter: Call to member function replace() on null - https://phabricator.wikimedia.org/T337427 [22:14:49] T335536: Hard deprecate class Hooks with all deprecated functions (and remove in 1.42) - https://phabricator.wikimedia.org/T335536 [22:19:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:42] (03PS1) 10EoghanGaffney: Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429 [22:30:21] PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:31:33] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:40] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41359/console" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney) [22:34:34] (03CR) 10Dzahn: [C: 03+1] Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney) [22:38:23] (03CR) 10Dzahn: [C: 03+1] "looks good to me in compiler: https://puppet-compiler.wmflabs.org/output/921244/41360/doc2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [22:44:50] (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:45:27] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41362/console" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney) [22:53:14] (03CR) 10Cwhite: [C: 03+1] mwlog: remove redis instance [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [22:55:53] (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:58:07] (03CR) 10Dzahn: [C: 03+2] miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [23:00:07] (03CR) 10Dzahn: [C: 03+1] "maybe disable puppet on all 4 releases* hosts, stop rsyncd on all 4 hosts, then merge this, double check the timers it creates.. then enab" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney) [23:00:55] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:02:07] (03PS1) 10Dzahn: Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check" [puppet] - 10https://gerrit.wikimedia.org/r/923284 [23:02:49] (03CR) 10Dzahn: "Info: Retrieving locales" [puppet] - 10https://gerrit.wikimedia.org/r/923284 (owner: 10Dzahn) [23:03:31] (03CR) 10Dzahn: [C: 03+2] "unfortunately this fails because there is no IPv6 AAAA record for discovery names" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [23:06:28] (03CR) 10Dzahn: [C: 03+2] Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check" [puppet] - 10https://gerrit.wikimedia.org/r/923284 (owner: 10Dzahn) [23:32:33] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder) [23:46:10] (03PS1) 10Andrew Bogott: Openstack trove hacks: update a patch to match the upstream patch in progress at [puppet] - 10https://gerrit.wikimedia.org/r/923436 [23:46:36] (03CR) 10CI reject: [V: 04-1] Openstack trove hacks: update a patch to match the upstream patch in progress at [puppet] - 10https://gerrit.wikimedia.org/r/923436 (owner: 10Andrew Bogott) [23:47:35] (03PS2) 10Andrew Bogott: Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436 [23:48:01] (03CR) 10CI reject: [V: 04-1] Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436 (owner: 10Andrew Bogott) [23:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale