[00:18:10] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:28] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536
[00:39:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536 (owner: 10TrainBranchBot)
[00:41:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:41:50] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:43:00] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:50] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:58:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:58:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922536 (owner: 10TrainBranchBot)
[01:03:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:03:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:05:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:05:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:19:36] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:22:26] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[04:23:52] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:30:28] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:15] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290)
[05:11:16] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table enwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1196-bin.001099, end_log_pos 654625806 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:11:44] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table plwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1156-bin.003729, end_log_pos 633898246 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:16:56] <icinga-wm>	 PROBLEM - Host an-worker1125 is DOWN: PING CRITICAL - Packet loss = 100%
[05:17:50] <icinga-wm>	 PROBLEM - Host db2110 #page is DOWN: PING CRITICAL - Packet loss = 100%
[05:18:18] <marostegui>	 checking 
[05:19:02] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table arwiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1158-bin.004706, end_log_pos 225760117 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:19:18] <icinga-wm>	 RECOVERY - Host db2110 #page is UP: PING WARNING - Packet loss = 66%, RTA = 31.64 ms
[05:19:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110', diff saved to https://phabricator.wikimedia.org/P48503 and previous config saved to /var/cache/conftool/dbconfig/20230525-051923-root.json
[05:20:12] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:20:14] <kart_>	 marostegui: Can I deploy cxserver?
[05:20:19] <marostegui>	 kart_: yes
[05:20:35] <kart_>	 marostegui: Thanks
[05:20:42] <marostegui>	 There's something wrong also with sanitarium db1154
[05:21:07] <Amir1>	 I deal with db1154
[05:21:14] <Amir1>	 I think I know what's going on
[05:21:44] <wikibugs>	 (03PS1) 10Marostegui: db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922958
[05:21:48] <marostegui>	 Amir1: what is going on?
[05:22:00] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 920.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2110: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/922958 (owner: 10Marostegui)
[05:22:17] <Amir1>	 I think it's flaggedrevs schema drift in sanitarium 
[05:22:19] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry)
[05:22:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 909.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:32] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 on db2110 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:40] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:42] <marostegui>	 Amir1: what I briefly saw on db1154:3311 was related to enwiki.user_properties
[05:22:44] <icinga-wm>	 PROBLEM - mysqld processes on db2110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:22:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 968.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:58] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db2110 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:23:00] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922826 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry)
[05:23:06] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 945.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:23:08] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 947.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:23:27] <Amir1>	 I got this PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288
[05:23:30] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 970.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:24:01] <Amir1>	 PROBLEM - MariaDB read only s4 on db2110 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:24:29] <marostegui>	 I am with db2110
[05:24:32] <marostegui>	 https://phabricator.wikimedia.org/T337445
[05:24:47] <Amir1>	 okay, it's quite noisy sigh
[05:24:52] <marostegui>	 I just downtimed
[05:25:06] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857
[05:25:36] <marostegui>	 !incidents
[05:25:36] <sirenbot>	 3678 (ACKED)  Host db2110 (paged) - PING  - Packet loss = 100%
[05:25:53] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857 (owner: 10KartikMistry)
[05:26:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-05-24-115506-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922857 (owner: 10KartikMistry)
[05:26:33] <Amir1>	 so back to db1154/db1155 all of them seem to be trying to delete a row and not finding it
[05:27:07] <Amir1>	 not one row, a different row in each section, different tables too, s7 is pagelinks
[05:27:32] <Amir1>	 my guess is that somehow it got corrupted 
[05:28:06] <marostegui>	 yes
[05:28:18] <marostegui>	 I rebooted them yesterday for the kernel thing, which doesn't explain any of this
[05:28:27] <marostegui>	 But it is the most probably cause (still doesn't make sense)
[05:28:43] <marostegui>	 They probably need to be entirely rebuilt
[05:29:54] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290)
[05:30:02] <marostegui>	 I can try to fix those missing rows, but I am sure there will be more
[05:30:21] <marostegui>	 I think it is better to rebuild them and hope that clouddb* hosts are ok
[05:30:37] <marostegui>	 Can you create a task?
[05:31:08] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 924.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:08] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 924.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 917.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:36] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 951.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:37] <Amir1>	 sure
[05:32:02] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 958.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:32:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 968.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:32:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 980.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:32:34] <Amir1>	 on it
[05:33:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:33:50] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry)
[05:34:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-05-24-115506-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/922959 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry)
[05:35:08] <icinga-wm>	 RECOVERY - mysqld processes on db2110 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:35:24] <icinga-wm>	 RECOVERY - MariaDB read only s4 on db2110 is OK: Version 10.4.26-MariaDB-log, Uptime 46s, read_only: True, event_scheduler: True, 2635.16 QPS, connection latency: 0.004890s, query latency: 0.000560s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:36:00] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:36:20] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:36:28] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s4 on db2110 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:36:59] <Amir1>	 marostegui: T337446 also it might be the case that somehow replication was re-played twice? 
[05:36:59] <stashbot>	 T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446
[05:37:03] <Amir1>	 I go get coffee
[05:37:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:41:09] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:41:39] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:46:41] <_joe_>	 jouncebot: nowandnext
[05:46:41] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[05:46:41] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600)
[05:46:41] <jouncebot>	 In 0 hour(s) and 13 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600)
[05:48:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:48:39] <_joe_>	 Amir1, marostegui can I steal your window if you're nto doing switchovers?
[05:48:47] <_joe_>	 I have something structural for mw on k8s
[05:48:51] <marostegui>	 yep
[05:48:55] <_joe_>	 thanks
[05:48:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:49:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[05:49:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[05:49:42] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385621020 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:50:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "LGTM, we can merge it as is" [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[05:51:06] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:51:06] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:51:09] <marostegui>	 Amir1: can you downtime all wikireplicas?
[05:52:18] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.flaggedpage_pending: Duplicate entry 1225932-0 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1154-bin.001684, end_log_pos 803 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:52:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161', diff saved to https://phabricator.wikimedia.org/P48504 and previous config saved to /var/cache/conftool/dbconfig/20230525-055236-root.json
[05:53:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1154, db1161 [puppet] - 10https://gerrit.wikimedia.org/r/923154
[05:54:25] <Amir1>	 marostegui: sure on it
[05:54:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1154, db1161 [puppet] - 10https://gerrit.wikimedia.org/r/923154 (owner: 10Marostegui)
[05:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 9 hosts with reason: T337446
[05:55:39] <stashbot>	 T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446
[05:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 9 hosts with reason: T337446
[05:55:59] <Amir1>	 done
[05:57:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P48506 and previous config saved to /var/cache/conftool/dbconfig/20230525-055734-root.json
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0600).
[06:01:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822)
[06:05:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[06:06:05] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[06:09:01] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[06:19:07] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s1 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:41:56] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:42:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:42:32] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:42:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on clouddb1020 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:42:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on clouddb1016 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:42:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on clouddb1021 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:44:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1196', diff saved to https://phabricator.wikimedia.org/P48509 and previous config saved to /var/cache/conftool/dbconfig/20230525-064418-root.json
[06:44:39] <wikibugs>	 (03PS11) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656)
[06:46:47] <wikibugs>	 (03PS1) 10Marostegui: db1196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923159
[06:50:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923159 (owner: 10Marostegui)
[07:00:06] <jouncebot>	 Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T0700).
[07:00:06] <jouncebot>	 matthiasmullie: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:11] <apergos>	 morning1 there are no trainees signed up today and one developer with two patchsets in the window for deployment. matthiasmullie do you usually self-deploy or should we deploy for you? Sorry that I ask this every time... 
[07:01:00] <matthiasmullie>	 o/
[07:01:07] <matthiasmullie>	 I can self-deploy
[07:01:11] <apergos>	 ok
[07:01:18] <apergos>	 the first patch seems straightforward enough
[07:01:30] <apergos>	 I have a couple questions about the second one since it touches a bunch of files
[07:01:38] <matthiasmullie>	 Sure!
[07:02:08] <apergos>	 does the mainternance script that is being changed run periodically, and the deploy will be during a time when it's not liable to start running?
[07:02:18] <wikibugs>	 (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561
[07:02:23] <RhinosF1>	 matthiasmullie: re the 2nd script, does it need to be on wmf.10 too?
[07:02:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie)
[07:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Add 'main subject of' property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921561 (owner: 10Matthias Mullie)
[07:04:20] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]]
[07:04:36] <matthiasmullie>	 apergos: it runs weekly, on Wed morning; it is not running and will not until next Wed (see https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/manifests/mediawiki/maintenance/image_suggestions.pp)
[07:05:02] <apergos>	 gotcha
[07:05:22] <apergos>	 and do you have a good method to test it on the mwdebug hosts as well as after the scap completes on the production cluster?
[07:06:01] <logmsgbot>	 !log mlitn@deploy1002 mlitn: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[07:06:15] <_joe_>	 matthiasmullie: can you stop there for a sec?
[07:06:25] <matthiasmullie>	 _joe_: yes
[07:06:26] <_joe_>	 I think I need to unlock k8s deployments for you
[07:06:31] <_joe_>	 give me 3-4 minutes
[07:06:39] <matthiasmullie>	 sure!
[07:07:03] <matthiasmullie>	 RhinosF1: wmf.10 is not urgent; we simply need to "test" it on prod data
[07:07:38] <_joe_>	 matthiasmullie: you can run your script manually on the mwdebug servers using mwscript IIRC
[07:07:54] <RhinosF1>	 matthiasmullie: but by next Wednesday, wmf.10 is already going to be everywhere when it actually runs
[07:08:39] <matthiasmullie>	 apergos: I was planning to run the script manually; there are params that will output all relevant data (--verbose) while being a no-op (--quiet)
[07:08:57] <apergos>	 great! that does it for me
[07:10:06] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[07:10:46] <matthiasmullie>	 RhinosF1: yes, that is fine; the code currently on wmf.9 and wmf.10 is fine, and the new patch only changes how it works internally (process things via job queue) - I want to test the new patch (on wmf.9). If/once it appears it all is working well, I can either submit another backport for wmf.10, or skip that backport altogether (because the current code will still be fine)
[07:10:56] <_joe_>	 matthiasmullie: please proceed
[07:11:00] <matthiasmullie>	 _joe_: rgr, thanks
[07:12:15] <RhinosF1>	 matthiasmullie: makes sense
[07:15:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:15:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:16:17] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:16:41] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on db1155 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:16:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:16:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) 05Open→03Resolved
[07:16:55] <wikibugs>	 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Joe)
[07:17:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P48511 and previous config saved to /var/cache/conftool/dbconfig/20230525-071719-root.json
[07:17:54] <wikibugs>	 (03PS1) 10Marostegui: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923243
[07:18:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923243 (owner: 10Marostegui)
[07:18:23] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:921561|[WikibaseMediaInfo] Add 'main subject of' property]] (duration: 14m 02s)
[07:19:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922853 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie)
[07:25:29] <wikibugs>	 (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923244
[07:26:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923244 (owner: 10Marostegui)
[07:34:03] <wikibugs>	 (03PS1) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[07:34:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "There's a typo. I've left also a suggestion and a question inline." [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[07:35:15] <wikibugs>	 (03Merged) 10jenkins-bot: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/922853 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie)
[07:35:45] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]]
[07:35:50] <stashbot>	 T322872: [L] Change how we send image-suggestions notifications to experienced users - https://phabricator.wikimedia.org/T322872
[07:37:16] <logmsgbot>	 !log mlitn@deploy1002 mlitn: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[07:45:09] <wikibugs>	 (03PS2) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[07:51:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247
[07:51:57] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] (duration: 16m 12s)
[07:52:02] <stashbot>	 T322872: [L] Change how we send image-suggestions notifications to experienced users - https://phabricator.wikimedia.org/T322872
[07:52:41] <matthiasmullie>	 !log UTC morning backports done
[07:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:26] <apergos>	 ah, good on production too? great!
[07:55:59] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:57:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder)
[08:02:14] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "Makes sense to remove the blackbox check from the legacy puppet code for now. According to the prometheus logs blackbox monitor still conn" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[08:03:10] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] microsites: remove http blackbox monitor for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[08:03:24] <wikibugs>	 (03PS1) 10Ayounsi: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249
[08:04:27] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:06:51] <wikibugs>	 (03CR) 10Volans: Add local config files to .gitignore (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi)
[08:08:35] <wikibugs>	 (03PS4) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557)
[08:11:21] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm)
[08:11:37] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) p:05Triage→03Low
[08:13:37] <wikibugs>	 (03PS2) 10Ayounsi: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249
[08:14:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi)
[08:14:25] <wikibugs>	 (03CR) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[08:15:29] <wikibugs>	 (03CR) 10Fabfur: Add a new cookbook that allows to run puppet configuration while restarting Varnish (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[08:16:31] <wikibugs>	 (03PS1) 10Ayounsi: Add the plugins directory to .gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/923251
[08:17:09] <wikibugs>	 (03PS3) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[08:17:38] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "It's already there as homer_plugins ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/923251 (owner: 10Ayounsi)
[08:18:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add local config files to .gitignore (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi)
[08:18:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41331/console" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[08:18:39] <wikibugs>	 (03Abandoned) 10Ayounsi: Add the plugins directory to .gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/923251 (owner: 10Ayounsi)
[08:19:46] <wikibugs>	 (03PS1) 10Matthias Mullie: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252
[08:22:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add local config files to .gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/923249 (owner: 10Ayounsi)
[08:27:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] profile: ensure varnish-aggregate-client-status-codes absent [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite)
[08:27:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/922534 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite)
[08:32:01] <elukey>	 !log revoke kafka_mirror_maker TLS cert (cergen based), remove old cergen certs from puppet private - T337248
[08:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:06] <stashbot>	 T337248: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248
[08:42:53] <wikibugs>	 (03PS3) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220)
[08:43:50] <wikibugs>	 (03PS4) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[08:45:06] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10jbond) > Netbox would be better. +1 this would also allow use to have them in the netbox-hiera pipeline which in turn makes it easier to add them all to...
[08:47:30] <wikibugs>	 (03CR) 10Slyngshede: "Rename a service to align with how be name similar services in other projects." [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[08:48:29] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1196: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923267
[08:48:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1196: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923267 (owner: 10Marostegui)
[08:49:10] <wikibugs>	 (03PS4) 10Cathal Mooney: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220)
[08:49:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48513 and previous config saved to /var/cache/conftool/dbconfig/20230525-084912-root.json
[08:49:28] <wikibugs>	 (03PS2) 10Jelto: trafficserver: switch annual.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041)
[08:49:54] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10Joe) It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if that's been solved in the latest versions.
[08:52:47] <wikibugs>	 (03CR) 10Jbond: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond)
[08:53:13] <wikibugs>	 (03PS1) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248)
[08:53:33] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] trafficserver: switch annual.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/922791 (https://phabricator.wikimedia.org/T337041) (owner: 10Jelto)
[08:53:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[08:57:31] <wikibugs>	 (03PS1) 10KartikMistry: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838)
[08:58:15] <wikibugs>	 (03PS1) 10KartikMistry: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838)
[08:59:06] <jinxer-wm>	 (ProbeDown) resolved: Service miscweb2003:443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:59:26] <wikibugs>	 (03PS1) 10Btullis: Revert "Re-enable an-test-worker1001 in the analytics_test_cluster" [puppet] - 10https://gerrit.wikimedia.org/r/923270
[09:00:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Re-enable an-test-worker1001 in the analytics_test_cluster" [puppet] - 10https://gerrit.wikimedia.org/r/923270 (owner: 10Btullis)
[09:04:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48514 and previous config saved to /var/cache/conftool/dbconfig/20230525-090417-root.json
[09:06:19] <wikibugs>	 (03PS2) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248)
[09:06:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[09:08:54] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Make db2179 candidate master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/923261 (https://phabricator.wikimedia.org/T337445)
[09:09:20] <wikibugs>	 (03CR) 10Jbond: "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[09:10:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED
[09:11:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Make db2179 candidate master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/923261 (https://phabricator.wikimedia.org/T337445) (owner: 10Marostegui)
[09:11:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2179', diff saved to https://phabricator.wikimedia.org/P48515 and previous config saved to /var/cache/conftool/dbconfig/20230525-091132-root.json
[09:14:53] <wikibugs>	 (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond)
[09:17:02] <wikibugs>	 (03PS3) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145)
[09:17:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:17:35] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:17:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond)
[09:19:06] <wikibugs>	 (03PS4) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145)
[09:19:14] <wikibugs>	 (03PS1) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171)
[09:19:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48516 and previous config saved to /var/cache/conftool/dbconfig/20230525-091922-root.json
[09:19:23] <wikibugs>	 (03PS1) 10Marostegui: db2172: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/923265
[09:19:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[09:19:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond)
[09:21:05] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:21:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2172: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/923265 (owner: 10Marostegui)
[09:21:28] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] doc: allow gitlab runners to publish docs only through `doc-gitlab` [puppet] - 10https://gerrit.wikimedia.org/r/922834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[09:21:37] <marostegui>	 apergos: good to merge your changes?
[09:21:41] <apergos>	 yes please
[09:21:48] <marostegui>	 done!
[09:21:51] <apergos>	 ty!
[09:22:00] <wikibugs>	 (03PS3) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248)
[09:23:28] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41332/console" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[09:23:30] <wikibugs>	 (03PS5) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[09:23:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[09:24:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48517 and previous config saved to /var/cache/conftool/dbconfig/20230525-092413-root.json
[09:24:24] <wikibugs>	 (03PS2) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171)
[09:24:53] <wikibugs>	 (03PS6) 10Slyngshede: C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245
[09:25:35] <wikibugs>	 (03PS1) 10ArielGlenn: Dumps: move the nfs share test conf to the right location [puppet] - 10https://gerrit.wikimedia.org/r/923289 (https://phabricator.wikimedia.org/T325232)
[09:26:02] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::configmaster:  dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705)
[09:26:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: profile::configmaster:  dump a json data structure of the pools (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[09:26:14] <wikibugs>	 (03CR) 10Slyngshede: "I think I got it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[09:27:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:27:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:27:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:28:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::configmaster:  dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[09:29:15] <wikibugs>	 (03PS1) 10Marostegui: db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923290
[09:32:22] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) >>! In T337453#8879233, @Joe wrote: > It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if tha...
[09:32:56] <apergos>	 !log running from dumpsdata1004 via ariel login screen session, as root, rsync with bwlimit 100000  to dumpsdata1006, copying all public xml dumps data 
[09:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:59] <wikibugs>	 (03CR) 10Jbond: gitlab: use sshkey for git-ssh public keys (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto)
[09:34:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48518 and previous config saved to /var/cache/conftool/dbconfig/20230525-093426-root.json
[09:35:22] <wikibugs>	 (03PS5) 10Jbond: admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145)
[09:37:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add class-of-service parent interface shaper for sub-rated services (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney)
[09:39:09] <wikibugs>	 (03PS3) 10Jelto: service::catalog add miscweb 15 and annual to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171)
[09:39:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48519 and previous config saved to /var/cache/conftool/dbconfig/20230525-093918-root.json
[09:40:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:41:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: Re-enable hghani [puppet] - 10https://gerrit.wikimedia.org/r/922799 (https://phabricator.wikimedia.org/T322145) (owner: 10Jbond)
[09:43:02] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201)
[09:44:26] <kart_>	 Is it OK to deploy fix for cxserver ^ marostegui 
[09:44:37] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8878207, @Jclark-ctr wrote: > @ayounsi  the provisioning script is still failing in row e/f.  dbproxy1026 dbproxy1027  I tested there...
[09:44:44] <marostegui>	 kart_: yep
[09:44:52] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899)
[09:45:02] <kart_>	 Thanks!
[09:45:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:45:35] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201) (owner: 10KartikMistry)
[09:46:27] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-05-25-093623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923291 (https://phabricator.wikimedia.org/T331201) (owner: 10KartikMistry)
[09:47:55] <wikibugs>	 (03CR) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[09:48:11] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:48:21] <wikibugs>	 (03PS1) 10Jbond: admin: add email for hghani [puppet] - 10https://gerrit.wikimedia.org/r/923293
[09:48:31] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:48:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: add email for hghani [puppet] - 10https://gerrit.wikimedia.org/r/923293 (owner: 10Jbond)
[09:49:02] <wikibugs>	 (03PS8) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065)
[09:49:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48520 and previous config saved to /var/cache/conftool/dbconfig/20230525-094931-root.json
[09:50:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) 05Open→03Resolved a:05CDanis→03jbond Access has now been configured and you should have received an email regarding K...
[09:51:25] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[09:51:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[09:52:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923290 (owner: 10Marostegui)
[09:53:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48521 and previous config saved to /var/cache/conftool/dbconfig/20230525-095341-root.json
[09:54:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48522 and previous config saved to /var/cache/conftool/dbconfig/20230525-095423-root.json
[09:56:58] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[09:57:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[09:57:35] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[09:58:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede)
[10:00:06] <jouncebot>	 mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1000).
[10:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1000)
[10:00:09] <kart_>	 !log Updated cxserver to 2023-05-25-093623-production (config: language pairs transform fix + T331201)
[10:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:16] <stashbot>	 T331201: Extract cxserver configuration and export to CSV - https://phabricator.wikimedia.org/T331201
[10:00:19] <wikibugs>	 (03PS4) 10Elukey: profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248)
[10:00:44] <wikibugs>	 (03CR) 10Elukey: "Thanks for the review John! Fixed the nits!" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[10:01:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41333/console" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[10:01:49] <wikibugs>	 (03PS2) 10EoghanGaffney: Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244
[10:03:16] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41334/console" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney)
[10:04:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48523 and previous config saved to /var/cache/conftool/dbconfig/20230525-100436-root.json
[10:08:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48524 and previous config saved to /var/cache/conftool/dbconfig/20230525-100846-root.json
[10:09:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48525 and previous config saved to /var/cache/conftool/dbconfig/20230525-100927-root.json
[10:16:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:16:41] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:19:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48526 and previous config saved to /var/cache/conftool/dbconfig/20230525-101940-root.json
[10:20:56] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409 (10Clement_Goubert) p:05Triage→03Medium
[10:23:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48527 and previous config saved to /var/cache/conftool/dbconfig/20230525-102351-root.json
[10:24:25] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 04-1] ttm: use new config option to separate readable and writable services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse)
[10:24:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48528 and previous config saved to /var/cache/conftool/dbconfig/20230525-102434-root.json
[10:24:47] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2005-dev.wikimedia.org
[10:28:06] <wikibugs>	 (03PS3) 10EoghanGaffney: Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579)
[10:28:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[10:32:48] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[10:33:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:34:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48529 and previous config saved to /var/cache/conftool/dbconfig/20230525-103445-root.json
[10:35:54] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[10:36:21] <wikibugs>	 (03CR) 10Klausman: ml-services: update docker images for outlink (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[10:38:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48530 and previous config saved to /var/cache/conftool/dbconfig/20230525-103855-root.json
[10:39:01] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002"
[10:39:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48531 and previous config saved to /var/cache/conftool/dbconfig/20230525-103939-root.json
[10:39:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] helmfile.d: Fix regex in api-gateway's config for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman)
[10:41:44] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2005-dev: move to the new network setup [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564)
[10:41:53] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002"
[10:41:53] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:41:54] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2005-dev.wikimedia.org
[10:44:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] "The only side effect that I observed was:" [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[10:45:32] <wikibugs>	 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10serviceops, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) 05Open→03Resolved a:03elukey
[10:46:16] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] helmfile.d: Fix regex in api-gateway's config for revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/922805 (https://phabricator.wikimedia.org/T337378) (owner: 10Klausman)
[10:48:08] <wikibugs>	 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10aborrero) a:05aborrero→03Jhancock.wm Please @Jhancock.wm update the physical network connection of this server from  `asw-b1-codfw (WMF59...
[10:48:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Thanks for working on this <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert)
[10:48:46] <logmsgbot>	 !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:49:16] <logmsgbot>	 !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:49:27] <logmsgbot>	 !log klausman@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:49:30] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[10:49:47] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert)
[10:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Change naming scheme for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/922480 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert)
[10:51:23] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2005-dev: move to the new network setup [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564)
[10:52:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "We need this patch to reimage cloudcontrol2005-dev into the new network setup." [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez)
[10:52:50] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[10:53:19] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[10:54:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48532 and previous config saved to /var/cache/conftool/dbconfig/20230525-105400-root.json
[10:54:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync
[10:54:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync
[10:54:32] <wikibugs>	 (03PS1) 10JMeybohm: modules.mesh.configuration: Copy 1.2.1 to 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923303
[10:54:32] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[10:54:34] <wikibugs>	 (03PS1) 10JMeybohm: mesh.configuration: Add type URL to http and listener filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/923304 (https://phabricator.wikimedia.org/T337405)
[10:54:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48533 and previous config saved to /var/cache/conftool/dbconfig/20230525-105443-root.json
[10:54:51] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[10:56:57] <wikibugs>	 (03Abandoned) 10Jcrespo: bacula: Reschedule run of es backups codfw -> eqiad [puppet] - 10https://gerrit.wikimedia.org/r/886837 (owner: 10Jcrespo)
[10:59:00] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Set $wgCampaignEventsUseNewTrackingToolsSchema to true in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364)
[10:59:13] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071)
[10:59:17] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] "Blocked on T336365" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923305 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy)
[11:00:39] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert)
[11:01:33] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Bump version to 0.4.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923306 (https://phabricator.wikimedia.org/T325071) (owner: 10Clément Goubert)
[11:03:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:03:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:04:42] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:05:06] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:05:23] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[11:09:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48534 and previous config saved to /var/cache/conftool/dbconfig/20230525-110905-root.json
[11:09:43] <jbond>	 !log upload udplog_1.10_amd64.deb
[11:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48535 and previous config saved to /var/cache/conftool/dbconfig/20230525-110948-root.json
[11:11:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] udp2log: update to take account of systemd updates [puppet] - 10https://gerrit.wikimedia.org/r/922867 (https://phabricator.wikimedia.org/T276623) (owner: 10Jbond)
[11:15:28] <jbond>	 !log update udplog on mwlog server
[11:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:49] <wikibugs>	 (03PS1) 10Btullis: Revert "Add an extra property 'CollectMode' to each user's jupyter service" [puppet] - 10https://gerrit.wikimedia.org/r/923271
[11:16:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Add an extra property 'CollectMode' to each user's jupyter service" [puppet] - 10https://gerrit.wikimedia.org/r/923271 (owner: 10Btullis)
[11:20:56] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[11:21:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[11:22:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[11:22:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:24:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48536 and previous config saved to /var/cache/conftool/dbconfig/20230525-112409-root.json
[11:25:05] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[11:25:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[11:25:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[11:25:53] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:26:06] <wikibugs>	 (03PS1) 10Jbond: puppetmaster::common: fix lint errors and docs [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490)
[11:26:44] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:26:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:27:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:27:56] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:28:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2007.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:28:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:30:14] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:30:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:31:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:31:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:32:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:32:50] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 33): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41335/console" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[11:32:59] <claime>	 Checking why PyBal is seeing them down
[11:34:26] <claime>	 curls working for me
[11:36:45] <claime>	 It's not logging a recovery because there's a lingering warning for schema_443
[11:38:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:38:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:38:54] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:39:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48537 and previous config saved to /var/cache/conftool/dbconfig/20230525-113914-root.json
[11:39:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:39:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:39:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:39:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41336/console" [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:40:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:40:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:40:38] <godog>	 got the page and I am out at lunch
[11:40:44] <jayme>	 looking
[11:40:51] <godog>	 can grab the laptop if needed tho
[11:40:57] <godog>	 thank you jayme
[11:41:05] <wikibugs>	 (03CR) 10Hoo man: [C: 04-1] install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond)
[11:41:05] <claime>	 here if you need me jayme
[11:41:46] <jayme>	 !incidents
[11:41:46] <sirenbot>	 3678 (ACKED)  Host db2110 (paged) - PING  - Packet loss = 100%
[11:41:46] <sirenbot>	 3679 (UNACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[11:41:56] <jayme>	 !ack 3679
[11:41:56] <sirenbot>	 3679 (ACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[11:43:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:43:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1022.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:43:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:44:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:44:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:45:35] <godog>	 mhh we are back? I will go back to lunch and page me when assistance is needed
[11:46:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:48:28] <jayme>	 godog: librenms says traffic is dropping again, yes
[11:48:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963)
[11:49:14] <godog>	 ack! thanks jayme
[11:49:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:49:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:51:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:51:59] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:52:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:52:27] <wikibugs>	 (03PS1) 10Jbond: puppetdb: Add support for submit_only_server_urls [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490)
[11:53:38] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963)
[11:54:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:54:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:56:31] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:56:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 27): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41337/console" [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:56:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-int_4446: Servers kubernetes1022.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:56:49] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:57:29] <jayme>	 there was a spike in requests from AS55839 (Jio) to upload 
[11:57:52] <jayme>	 mainly from android UAs, no referer
[11:58:03] <wikibugs>	 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) @Papaul @wiki_willy this server is out of warranty right? I don't know if there's much we can do about  ` 2023-05-25 05:16:13  SYS1003  System CPU Resetting.  `
[11:59:06] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567
[12:02:18] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche)
[12:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:04:28] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899)
[12:06:08] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567
[12:06:41] <wikibugs>	 (03CR) 10Cathal Mooney: "Looks good to me overall, but we should refactor to make 'supernetpub' an array and include 185.15.56.0/24." [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez)
[12:06:49] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:10:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool sanitarium masters for s1, s5, s2, s7', diff saved to https://phabricator.wikimedia.org/P48538 and previous config saved to /var/cache/conftool/dbconfig/20230525-121012-root.json
[12:10:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez)
[12:11:26] <wikibugs>	 (03PS2) 10Jbond: install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348)
[12:11:59] <wikibugs>	 (03CR) 10Jbond: "updated to add hostname validation" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond)
[12:12:39] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) In a project's `.gitlab-ci.yml`, it is now possible to publish documentation and test coverage results to doc.wikimedia.org using [[ https://...
[12:16:44] <wikibugs>	 (03CR) 10Cathal Mooney: "Some comments back inline." [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez)
[12:18:22] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[12:19:12] <icinga-wm>	 PROBLEM - Host releases1003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:19:32] <wikibugs>	 (03PS11) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107)
[12:19:45] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567
[12:20:04] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] profile::kafka::{broker,mirror}: simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/923259 (https://phabricator.wikimedia.org/T337248) (owner: 10Elukey)
[12:21:19] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41338/console" [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto)
[12:23:40] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] gitlab: use sshkey for git-ssh public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto)
[12:24:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:24:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:24:38] <jayme>	 !incidents
[12:24:39] <sirenbot>	 3680 (UNACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[12:24:39] <sirenbot>	 3678 (RESOLVED)  Host db2110 (paged) - PING  - Packet loss = 100%
[12:24:40] <sirenbot>	 3679 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[12:24:48] <icinga-wm>	 RECOVERY - Host releases1003 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms
[12:24:52] <jayme>	 !ack 3680
[12:24:52] <sirenbot>	 3680 (ACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[12:25:40] <jayme>	 !incidents
[12:25:40] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[12:25:41] <sirenbot>	 3680 (ACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[12:25:41] <sirenbot>	 3678 (RESOLVED)  Host db2110 (paged) - PING  - Packet loss = 100%
[12:25:41] <sirenbot>	 3679 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[12:26:39] <jayme>	 godog: same, same
[12:26:53] <jayme>	 I'm going to craft a requestctl rule to throttle them
[12:28:48] <wikibugs>	 (03CR) 10Jelto: "This change is mostly to get probes for the new kubernetes services annual.wikimedia.org and 15.wikipedia.org. But I'm not sure if we need" [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[12:33:52] <wikibugs>	 (03CR) 10AikoChou: ml-services: update docker images for outlink (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[12:35:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm)
[12:37:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm)
[12:39:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:39:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:39:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto)
[12:41:19] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I don't think the current implementation would work" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond)
[12:43:02] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436)
[12:43:15] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436)
[12:51:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: Add support for submit_only_server_urls [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[12:51:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::common: fix lint errors and docs [puppet] - 10https://gerrit.wikimedia.org/r/923322 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[12:57:31] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[12:58:32] <godog>	 jayme: ok! happy to review
[13:00:06] <wikibugs>	 (03PS1) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[13:00:06] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1300)
[13:00:07] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1300).
[13:00:07] <jouncebot>	 matthiasmullie, kart_, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <matthiasmullie>	 o/
[13:00:18] <kart_>	 0/
[13:00:34] <urbanecm>	 o/ I'm only available for the first ~45 minutes
[13:00:39] <MatmaRex>	 hi
[13:00:40] <TheresNoTime>	 I'm about to go into a meeting with my manager, sorry! D:
[13:00:44] <urbanecm>	 enjoy!
[13:01:02] <urbanecm>	 i suggest i start with MatmaRex's patches and then hand it over to kart_ / matthiasmullie for self-deployment of their patches if that's fine?
[13:01:10] <kart_>	 I can deploy my patches
[13:01:12] <matthiasmullie>	 sure
[13:01:17] <urbanecm>	 okay, starting!
[13:01:30] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:01:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:04:31] <urbanecm>	 matthiasmullie: fyi, squashing patches is not actually needed to make patches go out together/save space (multiple patches can be deployed simantinelously; on the deployment server, you can do that by `scap backport change1 change2 change3 ...`). no issues with that of course, just letting you know!
[13:05:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:05:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:05:52] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[13:06:48] <matthiasmullie>	 urbanecm: good to know; but I suppose they'd still all be queued in CI?
[13:07:16] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Docker
[13:07:22] <urbanecm>	 well, yes, but AFAICS CI is usually able to process few patches at once.
[13:09:13] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[13:10:16] <matthiasmullie>	 yeah 2 parallel should not be an issue :p thanks for the headsup
[13:10:46] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10cmooney) >>! In T336564#8879530, @aborrero wrote: > Please @Jhancock.wm update the physical network connection of this server from...
[13:11:58] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:12:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41339/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:12:22] <wikibugs>	 (03PS4) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464)
[13:13:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup)
[13:13:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup)
[13:14:31] <kart_>	 urbanecm: so, I can also deploy two patches together. This is cool!
[13:14:48] <urbanecm>	 yup, you can! even if they're in multiple release branches.
[13:15:06] <kart_>	 Super. Noted.
[13:18:32] <wikibugs>	 (03Merged) 10jenkins-bot: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923273 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:18:35] <wikibugs>	 (03Merged) 10jenkins-bot: Handle 'prefix' when 'action=edit', even if another extension overrides action [extensions/InputBox] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923274 (https://phabricator.wikimedia.org/T337436) (owner: 10Bartosz Dziewoński)
[13:19:06] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]]
[13:19:11] <stashbot>	 T337436: InputBox 'prefix' is ignored when ArticleCreationWorkflow takes over the page - https://phabricator.wikimedia.org/T337436
[13:19:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie)
[13:20:38] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:20:51] <urbanecm>	 MatmaRex: your patch is at mwdebug1002, can you have a look please?
[13:21:33] <MatmaRex>	 yup. works fine now at https://en.wikipedia.org/w/index.php?title=Wikipedia:Article_wizard/CreateDraft&oldid=1116388402
[13:22:03] <urbanecm>	 awesome, proceeding.
[13:24:12] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED
[13:24:33] <matthiasmullie>	 mine can skip mwdebug & go right ahead (only affects currently inactive maint script, only on wikis where wmf.10  is not yet live); shall we also merge kart_ patches already?
[13:26:03] <urbanecm>	 acknowledged
[13:26:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry)
[13:26:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry)
[13:28:13] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923273|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]], [[gerrit:923274|Handle 'prefix' when 'action=edit', even if another extension overrides action (T337436)]] (duration: 09m 06s)
[13:28:18] <stashbot>	 T337436: InputBox 'prefix' is ignored when ArticleCreationWorkflow takes over the page - https://phabricator.wikimedia.org/T337436
[13:28:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie)
[13:28:25] <urbanecm>	 MatmaRex: your patch is live now.
[13:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:28:40] <MatmaRex>	 thanks
[13:29:10] <urbanecm>	 kart_: I've +2'ed your patches, but I'll likely not have the time to deploy them, as I'll have to leave soon. i'll let you know once i'm done with matthias.mullie's patch!
[13:29:12] <urbanecm>	 np
[13:29:24] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "one addition: I think we could also explicitly set the ip to connect to the new kubernetes ingress. Setting the ip is also done here for e" [puppet] - 10https://gerrit.wikimedia.org/r/922918 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[13:30:21] <kart_>	 urbanecm: sure
[13:33:23] <wikibugs>	 (03CR) 10JHathaway: puppet-merge: implement Lock out, tag out (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond)
[13:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:34:53] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107)
[13:35:15] <wikibugs>	 (03CR) 10JHathaway: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond)
[13:35:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond)
[13:38:16] <wikibugs>	 (03PS2) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[13:38:19] <wikibugs>	 (03Merged) 10jenkins-bot: Change maint script to do work via jobs [extensions/ImageSuggestions] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923252 (owner: 10Matthias Mullie)
[13:38:35] <wikibugs>	 (03PS1) 10Jelto: miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171)
[13:38:48] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923252|Change maint script to do work via jobs]]
[13:39:43] <wikibugs>	 (03CR) 10Jelto: "possibly a intermediate solution beside removing the checks completely in I7c533a4308a84088a54911dd1ddfb913395766b0" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[13:40:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41340/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:41:46] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::configmaster:  dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705)
[13:44:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::configmaster:  dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[13:44:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED
[13:44:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED
[13:45:40] <wikibugs>	 (03PS2) 10Jelto: miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171)
[13:46:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/923292 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:46:30] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923252|Change maint script to do work via jobs]] (duration: 07m 42s)
[13:46:33] <urbanecm>	 matthiasmullie: your patch is live. kart_, please deploy your patches once they merge!
[13:46:48] <matthiasmullie>	 urbanecm: thanks man!
[13:46:51] <wikibugs>	 (03Merged) 10jenkins-bot: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/923268 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry)
[13:46:53] <wikibugs>	 (03Merged) 10jenkins-bot: Show Contribute menu item in main menu when Special:Contribute is enabled [skins/MinervaNeue] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923269 (https://phabricator.wikimedia.org/T336838) (owner: 10KartikMistry)
[13:47:02] <urbanecm>	 any time
[13:50:20] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: profile::configmaster:  dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705)
[13:50:48] <kart_>	 urbanecm: Thanks!
[13:50:52] <kart_>	 Deploying..
[13:52:07] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]]
[13:52:11] <stashbot>	 T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838
[13:53:01] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107)
[13:53:03] <wikibugs>	 (03PS3) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[13:53:11] <wikibugs>	 (03CR) 10JHathaway: install_console: restrict options used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond)
[13:53:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: profile::configmaster:  dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[13:53:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond)
[13:53:37] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:53:38] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705)
[13:55:31] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41341/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:56:39] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] cassandra: add support for version 4.1.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[13:56:45] <wikibugs>	 (03PS3) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107)
[13:56:47] <wikibugs>	 (03PS4) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[13:57:15] <wikibugs>	 (03Abandoned) 10Jbond: puppetmaster: frontend [puppet] - 10https://gerrit.wikimedia.org/r/923341 (https://phabricator.wikimedia.org/T337107) (owner: 10Jbond)
[13:58:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::configmaster: dump a json data structure of the pools [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[13:59:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41342/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[13:59:34] <wikibugs>	 (03PS1) 10Herron: mwlog: remove redis instance [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277)
[13:59:50] <wikibugs>	 (03PS1) 10BBlack: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349
[14:00:38] <wikibugs>	 (03PS12) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656)
[14:00:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/923325 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:00:57] <wikibugs>	 (03CR) 10Volans: "post-merge thing to fix" [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[14:03:12] <kart_>	 (My patches are stil being deployed, had issue with cache it seems so wasn't able to test it until trying hard)
[14:03:37] <topranks>	 kart_: no probs, let me know when they're done 
[14:03:59] <topranks>	 and anyone else with deploys still running, as I want to lock scap after to do some LVS maintenance (T322937)
[14:04:00] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[14:04:34] <kart_>	 topranks: sure. Few minutes probably..
[14:04:49] <topranks>	 no rush 
[14:05:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:05:45] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[14:06:12] <wikibugs>	 (03PS2) 10BBlack: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349
[14:06:18] <kart_>	 topranks: looks like it is failing due to some error.
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:06:39] <kart_>	 `2023-05-25 14:06:26,265 [WARNING] Issues connecting to lvs1019:9090: HTTPConnectionPool(host='lvs1019', port=9090): Max retries exceeded with url: /pools/parsoid-php_443/parse1008.eqiad.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f40576abf28>: Failed to establish a new connection: [Errno 111] Connection refused'))`
[14:06:41] <wikibugs>	 (03CR) 10Ladsgroup: profile::configmaster: dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[14:06:46] <wikibugs>	 (03PS4) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661)
[14:06:54] <topranks>	 eh perhaps that was me sorry 
[14:07:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack)
[14:07:38] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:07:45] <topranks>	 kart_: I'd disabled puppet and pybal in preparation on lvs1019, didn't think it would affect you though 
[14:07:51] <topranks>	 re-started now so you can try again 
[14:07:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (to my vcl-untrained eye anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack)
[14:08:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1418.eqiad.wmnet, mw1417.eqiad.wmnet, mw1416.eqiad.wmnet, mw1415.eqiad.wmnet, mw1414.eqiad.wmnet are marked down but pooled: parsoid-php_443: Servers parse1017.eqiad.wmnet, parse1011.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1447.eqiad.wmnet, mw1448.eqiad.wmnet, mw1449.eqiad.wmnet, mw1450.eqiad.w
[14:08:02] <icinga-wm>	  marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:08:03] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 15m 56s)
[14:08:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:08:08] <stashbot>	 T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838
[14:08:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/923349 (owner: 10BBlack)
[14:08:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:08:18] <icinga-wm>	 PROBLEM - Host releases2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:08:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48544 and previous config saved to /var/cache/conftool/dbconfig/20230525-140822-ladsgroup.json
[14:08:28] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[14:08:34] <kart_>	 topranks: yeah, restarting it.
[14:08:38] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppetboard.restart-reboot rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard)
[14:08:59] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors
[14:09:02] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors
[14:09:06] <kart_>	 `14:08:03 66 hosts had failures restarting php-fpm`
[14:09:06] <kart_>	 `14:08:03 58 hosts had failures restarting php-fpm`
[14:09:06] <kart_>	 `14:08:03 24 hosts had failures restarting php-fpm`
[14:09:07] <jinxer-wm>	 (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:09:12] <icinga-wm>	 RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:09:34] <kart_>	 I'm re-running scap backport
[14:09:44] <godog>	 paged for parsoid, I'm assuming that's related to the ongoing deployment ?
[14:09:51] <bblack>	 what's going on with that and the pybal parsiod-php thing?
[14:09:53] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling restart_daemons on P{puppetboard2002.codfw.wmnet} and (A:puppetboard)
[14:10:04] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]]
[14:10:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:06] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:06] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:06] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[14:10:07] <jinxer-wm>	 (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:08] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:12] <kart_>	 godog: no. Something else.
[14:10:16] <bblack>	 ok caught up a bit, I get the pybal part now, ignore that
[14:10:17] <godog>	 bblack: looks like some coordination for work on lvs1019
[14:10:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1418.eqiad.wmnet, mw1417.eqiad.wmnet, mw1416.eqiad.wmnet, mw1415.eqiad.wmnet, mw1414.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1447.eqiad.wmnet, mw1448.eqiad.wmnet, mw1449.eqiad.wmnet, mw1450.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:10:24] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or
[14:10:24] <icinga-wm>	 ESTBase
[14:10:25] <godog>	 kart_: ack, thank you
[14:10:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:32] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:32] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:34] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:36] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:38] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:40] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:10:40] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:10:40] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.
[14:10:40] <icinga-wm>	 a.org/wiki/RESTBase
[14:10:43] <topranks>	 bblack: I may have messed up, had tried to disable PyBal on lvs1019, expecting failover to lvs1020
[14:10:44] <wikibugs>	 (03PS5) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[14:10:46] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was rece
[14:10:46] <icinga-wm>	 domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) timed 
[14:10:46] <icinga-wm>	 re a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:10:52] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-ht
[14:10:52] <icinga-wm>	 e} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test 
[14:10:52] <icinga-wm>	 urned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:11:02] <icinga-wm>	 PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or
[14:11:02] <icinga-wm>	 ESTBase
[14:11:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:04] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:07] <bblack>	 topranks: yeah there's an outstanding problem on the scap side that requires us to not do LVS maintenance during deploys :/
[14:11:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:10] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:10] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or
[14:11:10] <icinga-wm>	 ESTBase
[14:11:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:10] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:10] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or
[14:11:10] <jayme>	 !incidents
[14:11:11] <sirenbot>	 3681 (UNACKED)  ProbeDown sre (10.2.2.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 eqiad)
[14:11:11] <icinga-wm>	 ESTBase
[14:11:11] <sirenbot>	 3680 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[14:11:11] <sirenbot>	 3678 (RESOLVED)  Host db2110 (paged) - PING  - Packet loss = 100%
[14:11:11] <sirenbot>	 3679 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqsin.wikimedia.org)
[14:11:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:12] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:14] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:14] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:15] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:16] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:17] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[14:11:17] <topranks>	 bblack: reversed that anyway PyBal back 2m30s
[14:11:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:18] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:19] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:20] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:20] <jayme>	 !ack 3681
[14:11:21] <sirenbot>	 3681 (ACKED)  ProbeDown sre (10.2.2.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 eqiad)
[14:11:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:21] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:22] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:11:22] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu
[14:11:23] <icinga-wm>	 e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:11:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:24] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{dom
[14:11:24] <icinga-wm>	 media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title
[14:11:25] <icinga-wm>	 TICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:11:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:26] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:27] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:29] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:11:30] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.or
[14:11:30] <icinga-wm>	 ESTBase
[14:11:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1026.mgmt.eqiad.wmnet with reboot policy FORCED
[14:11:41] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[14:11:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim
[14:11:42] <icinga-wm>	 /wiki/Services/Monitoring/restbase
[14:12:00] <topranks>	 bblack: my bad yes, sukhe explained the order of ops properly to me and I messed up and shut pybal ahead of locking deploys 
[14:12:02] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:12:03] <bblack>	 I suspect they're all depooled
[14:12:05] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[14:12:07] <kart_>	 I'll wait for a while to go ahead then..
[14:12:10] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[14:12:19] <topranks>	 kart_: yes, sorry!
[14:12:32] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.28:443]) https://wikitech.wikimedia.org/wiki/PyBal
[14:12:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED
[14:12:51] <Jhs>	 time to update https://www.wikimediastatus.net/ ?
[14:13:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48545 and previous config saved to /var/cache/conftool/dbconfig/20230525-141318-ladsgroup.json
[14:13:20] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.28:443]) https://wikitech.wikimedia.org/wiki/PyBal
[14:13:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41343/console" [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:13:39] <akosiaris>	 I 'll pull all of parsoid in eqiad
[14:13:39] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc
[14:13:42] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[14:13:45] <akosiaris>	 it's depooled per https://config-master.wikimedia.org/pybal/eqiad/parsoid-php
[14:13:48] <jinxer-wm>	 eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:14:07] <jinxer-wm>	 (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:14:20] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: service=parsoid-php,dc=eqiad
[14:14:26] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:14:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:14:37] <bblack>	 ^ my conftool above should mitigate
[14:14:46] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:14:50] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:14:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:14:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:15:07] <jinxer-wm>	 (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:15:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:15:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:15:50] <icinga-wm>	 RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:15:56] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:15:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:15:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:16:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:16:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:16:31] <urbanecm>	 we also seem to have all mw servers in eqiad depooled
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:44] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:16:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:16:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:16:52] <bblack>	 urbanecm: right now? which service keys?
[14:16:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:05] <Amir1>	 here
[14:17:14] <urbanecm>	 bblack: https://config-master.wikimedia.org/pybal/eqiad/appservers-https claims enabled:false
[14:17:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:41] <AzaTht>	 are you performing maintenance or are you experiencing technical problem?
[14:17:44] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:17:46] <icinga-wm>	 RECOVERY - Host releases2003 is UP: PING OK - Packet loss = 0%, RTA = 31.97 ms
[14:18:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:00] <urbanecm>	 AzaTht: we're experiencing a technical problem, but we're on it. please be patient.
[14:18:04] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:18:21] <wikibugs>	 10SRE, 10SRE-Unowned, 10Discovery-Search, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10jbond)
[14:18:29] <AzaTht>	 urbanecm: (thumbs up)
[14:18:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:34] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:18:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:18:52] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:19:02] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:19:04] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:19:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:19:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:19:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:19:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:19:07] <jinxer-wm>	 (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:19:13] <sukhe>	 hi
[14:19:14] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:19:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:19:22] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[14:19:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[14:19:49] <kart_>	 urbanecm: I was in the middle of deployment and then it failed with lvs errors and then restarted scap - now I'm seating at patch deployed on mwdebug to test and waiting till above issue is fixed :/ 
[14:20:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:07] <jinxer-wm>	 (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:20:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:10] <akosiaris>	 this is probably a result of https://phabricator.wikimedia.org/T334703
[14:20:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:20:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:57] <urbanecm>	 kart_: it's not your fault at all. but yes, let's wait for the sites to be up and then deployment can be finished
[14:21:06] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[14:21:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:21:19] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=appserver,dc=eqiad
[14:21:24] <wikibugs>	 (03PS1) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490)
[14:21:29] <kart_>	 urbanecm: yes. Half deployed thing - I'll wait.
[14:21:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:21:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:21:38] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=eqiad
[14:21:38] <claime>	 repooling appservers in eqiad 
[14:21:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:21:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:59] <AzaTht>	 I suggest setting https://www.wikimediastatus.net/ more than "editing issues", I'm just getting a generic Error message atm
[14:22:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:03] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver
[14:22:04] <wikibugs>	 (03PS6) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[14:22:07] <wikibugs>	 (03PS2) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490)
[14:22:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:10] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:22:10] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:22:12] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:22:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072']
[14:22:14] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:22:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:22:26] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=jobrunner
[14:22:28] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:22:36] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler
[14:22:38] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:22:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:23:00] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:23:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:23:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:23:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:23:14] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:23:21] <claime>	 jobrunners/videoscalers repooled
[14:23:22] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:23:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41344/console" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:24:07] <jinxer-wm>	 (ProbeDown) resolved: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:30] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[14:25:07] <jinxer-wm>	 (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1026']
[14:25:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022']
[14:26:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022']
[14:26:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022']
[14:26:22] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:26:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022']
[14:26:43] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[14:26:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[14:26:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1023']
[14:27:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022']
[14:27:11] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1022']
[14:27:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1023']
[14:27:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1023']
[14:27:40] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:27:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1022']
[14:28:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1022']
[14:28:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1023']
[14:28:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1024']
[14:28:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:28:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1025']
[14:28:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P48546 and previous config saved to /var/cache/conftool/dbconfig/20230525-142824-ladsgroup.json
[14:28:30] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbproxy1024']
[14:28:31] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1025']
[14:28:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1026']
[14:28:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1027']
[14:28:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:29:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbproxy1027']
[14:29:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:29:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbproxy1026']
[14:30:07] <jinxer-wm>	 (ProbeDown) resolved: (9) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr)
[14:31:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, typo in commit msg 😊" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:32:18] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye
[14:32:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye
[14:33:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE
[14:33:27] <wikibugs>	 (03CR) 10Gmodena: mw-page-content-change-enrich: enable checkpointing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[14:33:38] <wikibugs>	 (03PS7) 10Jbond: puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490)
[14:33:40] <wikibugs>	 (03PS3) 10Jbond: puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490)
[14:33:42] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: fix puppetdb_submit_only_hosts [puppet] - 10https://gerrit.wikimedia.org/r/923356
[14:34:41] <wikibugs>	 (03PS13) 10Gmodena: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656)
[14:34:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[14:36:42] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable checkpointing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[14:36:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[14:36:54] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:29] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: enable checkpointing [deployment-charts] - 10https://gerrit.wikimedia.org/r/922831 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[14:38:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:30] <wikibugs>	 (03CR) 10Volans: profile::configmaster: dump a json data structure of the pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921692 (https://phabricator.wikimedia.org/T330705) (owner: 10Giuseppe Lavagetto)
[14:40:02] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[14:40:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:40:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41345/console" [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:40:36] <kart_>	 Are we good to continue deployment in-progress?
[14:41:36] <godog>	 bblack jayme ^ what do you think ?
[14:42:26] <godog>	 I think we're okay to resume, I'd like a second opinion though
[14:42:37] <kart_>	 Sure. I'll wait.
[14:43:16] <jayme>	 we still have elevated latencies on appservers, let's see if that settles first
[14:43:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P48547 and previous config saved to /var/cache/conftool/dbconfig/20230525-144330-ladsgroup.json
[14:44:04] <godog>	 good call yeah
[14:44:28] <godog>	 for reference, the alert's dashboard for the appserver latency in this case is https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad+prometheus%2Fops&var-method=GET&viewPanel=9
[14:45:08] <claime>	 Yeah it's specifically api_appservers
[14:45:31] <godog>	 that's right api_appservers, my bad
[14:51:22] <wikibugs>	 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF)
[14:54:01] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:54:52] <marostegui>	 !log Wikireplicas are lagging behind for the following sections: s1, s2, s5, s7 T337446
[14:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:57] <stashbot>	 T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446
[14:57:04] <wikibugs>	 (03CR) 10Dzahn: "Oh, interesting option I had not even considered. Seems like a good idea until everything is in service catalog, which yes, has a lot of o" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[14:58:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48548 and previous config saved to /var/cache/conftool/dbconfig/20230525-145836-ladsgroup.json
[14:58:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[14:58:41] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[14:58:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[14:58:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48549 and previous config saved to /var/cache/conftool/dbconfig/20230525-145857-ladsgroup.json
[15:03:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48550 and previous config saved to /var/cache/conftool/dbconfig/20230525-150347-ladsgroup.json
[15:03:52] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:04:38] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr1-eqiad,lsw1-e1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad
[15:04:53] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-eqiad,lsw1-e1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad
[15:05:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=03f7b2ab-bdea-4c56-ac41-3ec30004db4a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s...
[15:05:55] <kart_>	 jayme: just ping me when it is OK to deploy. Or should I abandon it? It is deployed in wmf.9/mwdebug servers.
[15:06:13] <kart_>	 Patch: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/923268 (wmf.9)
[15:07:12] <jayme>	 kart_: latency is trending down and almost back to normal. I'd say we will be clear in a couple of minues
[15:07:54] <godog>	 +1
[15:08:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[15:08:16] <jinxer-wm>	 eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:08:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: maintenance
[15:08:55] <jayme>	 eheh
[15:09:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: maintenance
[15:10:41] <mutante>	 !log gerrit-replica.wikimedia.org - gerrit2002 - reimaging - scheduled maintenance
[15:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:48] <topranks>	 !log Migrating cr1-eqiad downlink to row E/F from lsw1-e1-eqiad et-0/0/48 to ssw1-e1-eqiad et-0/0/31
[15:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:30] <jayme>	 kart_ godog: I would say go ahead (cc bblack | urandom)
[15:11:41] <kart_>	 jayme: thanks. 
[15:12:14] <kart_>	 Going ahead..
[15:14:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bullseye
[15:15:04] <godog>	 agreed
[15:15:38] <wikibugs>	 (03PS1) 10Aklapper: Automate yearly Phabricator metrics for wikitech-l [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388)
[15:17:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert)
[15:17:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) p:05Triage→03High
[15:18:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert)
[15:18:12] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923268|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]], [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 68m 07s)
[15:18:16] <stashbot>	 T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838
[15:18:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P48551 and previous config saved to /var/cache/conftool/dbconfig/20230525-151853-ladsgroup.json
[15:20:13] <kart_>	 jayme: looks good on wmf.9
[15:20:23] <jayme>	 cool, thanks!
[15:20:29] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]]
[15:20:34] <kart_>	 jayme: finishing pending on wmf.10
[15:20:46] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-eqiad,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad
[15:21:01] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-eqiad,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad
[15:21:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf76e0ba-8648-48a0-beed-fe7b60f79656) set by cmooney@cumin1001 for 0:30:00 on 2 host(s...
[15:21:58] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[15:24:54] <wikibugs>	 (03PS1) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139)
[15:27:30] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:923269|Show Contribute menu item in main menu when Special:Contribute is enabled (T336838)]] (duration: 07m 01s)
[15:27:35] <stashbot>	 T336838: Avoid the Contributions menu to disappear on mobile web - https://phabricator.wikimedia.org/T336838
[15:28:00] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10elukey) The ML team is serving its Lift Wing model servers via the API gateway, so we'd benefit as well to have edge caching :)
[15:28:19] <kart_>	 jayme: I'm all done. Thanks a lot, SREs for taking care of issues.
[15:28:35] <jayme>	 ack, thanks!
[15:28:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye
[15:28:41] <kart_>	 (and all those who were helping, ofcourse!)
[15:28:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db...
[15:30:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
[15:30:56] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:56] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:19] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
[15:33:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e[1-2]-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad
[15:33:57] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e[1-2]-eqiad.mgmt with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad
[15:34:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P48552 and previous config saved to /var/cache/conftool/dbconfig/20230525-153359-ladsgroup.json
[15:34:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8f44dd48-0cac-4bfd-907a-512dfa686d40) set by cmooney@cumin1001 for 0:30:00 on 2 host(s...
[15:34:33] <wikibugs>	 (03CR) 10Jbond: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond)
[15:37:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Assuming this is old, can we abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu)
[15:38:27] <wikibugs>	 (03CR) 10Ottomata: "Should we merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[15:43:36] <logmsgbot>	 !log dancy@deploy1002 Started deploy [integration/docroot@78e6f40]: (no justification provided)
[15:43:46] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [integration/docroot@78e6f40]: (no justification provided) (duration: 00m 10s)
[15:44:05] <dancy>	 !log dancy@deploy1002 Updated scap URLs on doc.wikimedia.org
[15:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T336886)', diff saved to  and previous config saved to /var/cache/conftool/dbconfig/20230525-154906-ladsgroup.json
[15:49:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[15:49:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[15:49:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T336886)', diff saved to  and previous config saved to /var/cache/conftool/dbconfig/20230525-154927-ladsgroup.json
[15:49:32] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:49:48] <logmsgbot>	 !log dancy@deploy1002 Started deploy [integration/docroot@dac2b70]: Updated Scap URLs
[15:49:56] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [integration/docroot@dac2b70]: Updated Scap URLs (duration: 00m 07s)
[15:50:01] <jinxer-wm>	 (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:50:25] <bblack>	 do we have some known reason for the phab thing above?
[15:50:37] <wikibugs>	 (03CR) 10Ottomata: "Okay, so.  A reason why this will make metrics and dashboarding weird:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse)
[15:50:59] <godog>	 not that I'm aware of yet, looks like v6 only tho
[15:51:26] <bblack>	 maybe the row E/F -related work?
[15:51:34] <wikibugs>	 (03CR) 10Ottomata: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse)
[15:51:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48553 and previous config saved to /var/cache/conftool/dbconfig/20230525-155139-ladsgroup.json
[15:51:46] <bblack>	 hmmm that's in row B though
[15:52:02] <bblack>	 phabricator itself seems ok?
[15:52:21] <bblack>	 I wonder why we're paging on an individual server and not the overall service?
[15:52:22] <godog>	 yeah I see the probe recovered, the alert will recover shortly I'm assuming
[15:53:26] <godog>	 bblack: good question, I'm making a note to dig deeper tomorrow on why is that
[15:54:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:54:38] <bblack>	 I guess the alert is for the real public IP, it's just associated to the currently-active individual server
[15:54:48] <bblack>	 so, while it's confusing, it does make sense to page
[15:55:01] <jinxer-wm>	 (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:55:45] <godog>	 mmhh yeah confusing alright, the probed name should be there not the hostname
[15:56:48] <godog>	 FWIW related task is https://phabricator.wikimedia.org/T312840 and obviously I haven't got around to it yet
[15:56:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1
[15:56:57] <wikibugs>	 (03PS1) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427)
[15:57:11] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1
[15:57:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c43be552-7ced-4f58-99c1-a10b5984bf3a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s...
[15:57:25] <wikibugs>	 (03PS1) 10BBlack: Bugfix for hotlink URL patch earlier [puppet] - 10https://gerrit.wikimedia.org/r/923375
[15:57:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bugfix for hotlink URL patch earlier [puppet] - 10https://gerrit.wikimedia.org/r/923375 (owner: 10BBlack)
[15:58:50] <wikibugs>	 (03PS1) 10Elukey: helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376
[15:59:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:59:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[15:59:45] <wikibugs>	 (03PS2) 10BBlack: fix the UA matching in the earlier hotlink patch [puppet] - 10https://gerrit.wikimedia.org/r/923375
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:52] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] fix the UA matching in the earlier hotlink patch [puppet] - 10https://gerrit.wikimedia.org/r/923375 (owner: 10BBlack)
[16:01:19] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[16:02:20] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bullseye
[16:04:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to developer account/wmf for Amal Ramadan - https://phabricator.wikimedia.org/T337492 (10Aklapper) Hi @ARamadan-WMF, thanks for taking the time to report this! I assume this is about https://wikitech.wikimedia.org/wiki/Special:CreateAccount ? Which exact wiki usernam...
[16:06:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P48555 and previous config saved to /var/cache/conftool/dbconfig/20230525-160645-ladsgroup.json
[16:07:09] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[16:07:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on gerrit2002.wikimedia.org with reason: maintenance
[16:07:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gerrit2002.wikimedia.org with reason: maintenance
[16:07:57] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gerrit2002 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service daniel_zahn https://phabricator.wikimedia.org/T334521 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:07:57] <icinga-wm>	 ACKNOWLEDGEMENT - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site daniel_zahn https://phabricator.wikimedia.org/T334521 https://wikitech.wikimedia.org/wiki/Gerrit
[16:11:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e3-eqiad uplinks to spine
[16:11:30] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt with reason: Migrate lsw1-e3-eqiad uplinks to spine
[16:11:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=37545969-c51e-450d-9ef0-5fadfd151520) set by cmooney@cumin1001 for 0:30:00 on 3 host(s...
[16:14:20] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[16:14:24] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[16:16:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:12] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[16:18:16] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[16:18:37] <wikibugs>	 (03PS1) 10Dzahn: gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521)
[16:19:31] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[16:20:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn)
[16:20:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn)
[16:21:07] <wikibugs>	 (03PS2) 10Dzahn: gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521)
[16:21:21] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] gerrit: update SSH host key for reimaged gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/923378 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn)
[16:21:38] <wikibugs>	 (03CR) 10Func: "It seems the `wmf_deploy` branch should be used instead. Not sure how that works." [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[16:21:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P48556 and previous config saved to /var/cache/conftool/dbconfig/20230525-162151-ladsgroup.json
[16:22:18] <wikibugs>	 (03CR) 10Func: BannerRenderer: Make sure the language variant is valid (031 comment) [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[16:28:39] <wikibugs>	 (03Abandoned) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[16:29:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, I've setup the Kerberos authentication but I am having trouble signing into Jupyterhub and Wikimedia Dev single sign on:  {F37034637} {F370346...
[16:29:19] <wikibugs>	 10SRE, 10PyBal, 10Release-Engineering-Team, 10Scap, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo)
[16:34:17] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:34:33] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:34:48] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:34:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add class-of-service parent interface shaper for sub-rated services (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney)
[16:35:40] <wikibugs>	 (03Merged) 10jenkins-bot: Add class-of-service parent interface shaper for sub-rated services [homer/public] - 10https://gerrit.wikimedia.org/r/922603 (https://phabricator.wikimedia.org/T337220) (owner: 10Cathal Mooney)
[16:36:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48557 and previous config saved to /var/cache/conftool/dbconfig/20230525-163657-ladsgroup.json
[16:37:03] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[16:37:29] <wikibugs>	 (03PS2) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490)
[16:37:48] <wikibugs>	 (03PS1) 10Cathal Mooney: Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937)
[16:39:03] <sukhe>	 jouncebot: nowandnext
[16:39:03] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1600)
[16:39:03] <jouncebot>	 In 0 hour(s) and 20 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700)
[16:39:03] <jouncebot>	 In 0 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700)
[16:39:25] <wikibugs>	 (03PS2) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490)
[16:39:27] <wikibugs>	 (03PS2) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490)
[16:39:29] <wikibugs>	 (03PS3) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490)
[16:39:48] <topranks>	 !log adding outbound shaper config on eqsin to codfw transport cct (T328313)
[16:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:52] <stashbot>	 T328313: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313
[16:41:58] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41346/console" [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:42:29] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41347/console" [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:42:58] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41348/console" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:46:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[16:47:16] <wikibugs>	 (03PS1) 10BBlack: pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703)
[16:47:35] <wikibugs>	 (03Merged) 10jenkins-bot: Move row E/F core router uplinks to Spine switches [homer/public] - 10https://gerrit.wikimedia.org/r/923387 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[16:48:33] <wikibugs>	 (03PS3) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490)
[16:48:35] <wikibugs>	 (03PS3) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490)
[16:48:37] <wikibugs>	 (03PS4) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490)
[16:48:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney)
[16:49:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved Merged and shapers set on codfw to eqsin link.
[16:50:33] <wikibugs>	 (03PS1) 10David Caro: wmcs-backup: a couple fixes [puppet] - 10https://gerrit.wikimedia.org/r/923390
[16:51:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10wiki_willy) a:03Jhancock.wm Hi @Marostegui - Papaul is on paternity leave for another week, so I'm going to pass this over to @Jhancock.wm to check out.  The server is about 4yrs old, so it's out of warranty, but there...
[16:52:22] <wikibugs>	 (03PS4) 10Aqu: analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073)
[16:53:20] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41349/console" [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:53:26] <wikibugs>	 (03CR) 10Aqu: "Rebased and ready for merge." [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[16:54:23] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41350/console" [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:54:40] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391
[16:55:29] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41351/console" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[16:55:45] <wikibugs>	 (03PS4) 10Robertsky: Change project logo for Wikimania to Wikimania 2023 version T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610
[16:56:14] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393
[16:57:21] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391 (owner: 10BryanDavis)
[16:57:58] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393 (owner: 10BryanDavis)
[16:58:14] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-05-25-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923391 (owner: 10BryanDavis)
[16:59:04] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-05-22-111728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923393 (owner: 10BryanDavis)
[17:00:06] <jouncebot>	 bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1700)
[17:00:59] <bd808>	 o/ I have deploys for toolhub and developer-portal today. I'll start on them fairly soon.
[17:01:42] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust Eqiad row E/F switch parents in hierdata after cable moves [puppet] - 10https://gerrit.wikimedia.org/r/923395 (https://phabricator.wikimedia.org/T322937)
[17:02:14] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Yeah, I wonder if there's anything we can do to troubleshoot this from a hardware point of view.
[17:03:20] <wikibugs>	 (03PS1) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:03:53] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply
[17:05:08] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[17:05:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:06:13] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[17:06:13] <wikibugs>	 (03PS2) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:06:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) We migrated a bunch of network <-> network links today without issue (crossed them out in above table).  Didn't touch the LVS's aft...
[17:07:33] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[17:08:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41353/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:08:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:08:43] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[17:09:19] <wikibugs>	 (03PS3) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:09:49] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[17:12:39] <wikibugs>	 (03PS4) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:12:56] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:13:18] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:14:09] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:14:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41354/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:14:44] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:14:59] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:14:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[17:15:30] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:15:39] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] pybal: add support for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923389 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[17:17:19] <wikibugs>	 (03PS5) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:17:38] * bd808 is done deploying things
[17:17:47] <bblack>	 forever? :)
[17:18:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[17:18:24] <bd808>	 eh. probably jut for May 2023 :)
[17:19:43] <wikibugs>	 (03PS6) 10Jbond: ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396
[17:21:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41356/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:22:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41357/console" [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond)
[17:22:49] <wikibugs>	 (03PS1) 10Ssingh: Release 1.15.12 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923399 (https://phabricator.wikimedia.org/T334703)
[17:23:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48558 and previous config saved to /var/cache/conftool/dbconfig/20230525-172326-root.json
[17:23:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:24:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48559 and previous config saved to /var/cache/conftool/dbconfig/20230525-172413-root.json
[17:25:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release 1.15.12 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923399 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh)
[17:26:20] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entires for migration IPs eqiad row E F switches. - cmooney@cumin1001"
[17:27:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entires for migration IPs eqiad row E F switches. - cmooney@cumin1001"
[17:27:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:38:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48561 and previous config saved to /var/cache/conftool/dbconfig/20230525-173831-root.json
[17:39:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48562 and previous config saved to /var/cache/conftool/dbconfig/20230525-173918-root.json
[17:41:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Step 2 - Move CR Uplinks has now been completed.  We are also 50% of the way through steps 3 and 4.  Will continue with...
[17:53:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48563 and previous config saved to /var/cache/conftool/dbconfig/20230525-175335-root.json
[17:54:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48564 and previous config saved to /var/cache/conftool/dbconfig/20230525-175423-root.json
[17:56:43] <wikibugs>	 (03PS1) 10BBlack: pybal: quick bugfix for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923404 (https://phabricator.wikimedia.org/T334703)
[18:00:06] <jouncebot>	 ^demon and dancy: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1800). Please do the needful.
[18:02:43] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] pybal: quick bugfix for advertised instrumentation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923404 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[18:05:52] <wikibugs>	 (03PS1) 10Ssingh: Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703)
[18:08:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48565 and previous config saved to /var/cache/conftool/dbconfig/20230525-180840-root.json
[18:08:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh)
[18:09:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 1.15.13 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/923405 (https://phabricator.wikimedia.org/T334703) (owner: 10Ssingh)
[18:09:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48566 and previous config saved to /var/cache/conftool/dbconfig/20230525-180927-root.json
[18:15:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590)
[18:20:17] <wikibugs>	 (03PS2) 10Cathal Mooney: Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590)
[18:20:44] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Reapply new fix to en beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969)
[18:23:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48567 and previous config saved to /var/cache/conftool/dbconfig/20230525-182345-root.json
[18:24:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48568 and previous config saved to /var/cache/conftool/dbconfig/20230525-182432-root.json
[18:30:41] <sukhe>	 jouncebot: now
[18:30:41] <jouncebot>	 For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T1800)
[18:38:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48570 and previous config saved to /var/cache/conftool/dbconfig/20230525-183849-root.json
[18:39:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48571 and previous config saved to /var/cache/conftool/dbconfig/20230525-183937-root.json
[18:39:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: a couple fixes [puppet] - 10https://gerrit.wikimedia.org/r/923390 (owner: 10David Caro)
[18:42:31] <sukhe>	 dancy: no train deployment today, correct? sorry just checking since we will do a scap lock to test some LVS changes to prevent future scap locks when doing LVS change :)
[18:43:12] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@6b27584]: (no justification provided)
[18:43:31] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@6b27584]: (no justification provided) (duration: 00m 19s)
[18:43:35] <wikibugs>	 (03PS14) 10Andrew Bogott: backy2: Prepare for switch to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734)
[18:43:37] <wikibugs>	 (03PS1) 10Andrew Bogott: backy2: switch from sqlite to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/923410 (https://phabricator.wikimedia.org/T332734)
[18:46:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] backy2: Prepare for switch to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/922591 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott)
[18:53:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48572 and previous config saved to /var/cache/conftool/dbconfig/20230525-185354-root.json
[18:54:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48573 and previous config saved to /var/cache/conftool/dbconfig/20230525-185441-root.json
[18:57:02] <wikibugs>	 (03PS2) 10Andrew Bogott: backy2: switch from sqlite to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/923410 (https://phabricator.wikimedia.org/T332734)
[18:57:04] <wikibugs>	 (03PS1) 10Andrew Bogott: backy2: include python3-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/923412 (https://phabricator.wikimedia.org/T332734)
[18:59:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] backy2: include python3-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/923412 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott)
[19:00:52] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " (due to missing physical file for old image e... - https://phabricator.wikimedia.org/T244567
[19:02:28] <wikibugs>	 (03CR) 10Jdlrobson: Enable the new Special:Contribute page entry point for desktop on selected wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[19:02:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder)
[19:08:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48574 and previous config saved to /var/cache/conftool/dbconfig/20230525-190859-root.json
[19:09:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48575 and previous config saved to /var/cache/conftool/dbconfig/20230525-190946-root.json
[19:17:30] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] Update Puppet files for Airflow Upgrade to 2.3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu)
[19:18:25] <wikibugs>	 (03PS1) 10Jdrewniak: Use document feature classes to extract A/B test state [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972)
[19:19:45] <wikibugs>	 (03PS2) 10DCausse: ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284)
[19:20:24] <wikibugs>	 (03CR) 10DCausse: ttm: use new config option to separate readable and writable services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse)
[19:22:19] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+1] Reapply new fix to en beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia)
[19:24:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney)
[19:27:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:24] <dancy>	 sukhe: It looks like the train should be unblocked now. Demon do you plan to roll forward today?
[19:29:28] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.dns.netbox
[19:29:57] <dancy>	 sukhe: Feel free to hold the scap lock as needed.
[19:31:25] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add pybal-low-traffic.svc.codfw.wmnet - bblack@cumin1001"
[19:32:06] <sukhe>	 dancy: thanks, we will let you know
[19:32:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:32:15] <sukhe>	 feel free to proceed for now, thanks 
[19:32:30] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add pybal-low-traffic.svc.codfw.wmnet - bblack@cumin1001"
[19:32:30] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:33:22] <sukhe>	 we will let you know here if we block scap but assume no for now.
[19:38:09] <wikibugs>	 (03PS1) 10BBlack: Add pybal-low-traffic.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923414 (https://phabricator.wikimedia.org/T334703)
[19:38:41] <dancy>	 Ok
[19:39:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add pybal-low-traffic.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923414 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[19:52:58] <wikibugs>	 (03PS2) 10Jdrewniak: Enable Vector "Zebra" AB test to enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia)
[19:55:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:56:28] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515)
[20:00:06] <jouncebot>	 brennen and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230525T2000).
[20:00:06] <jouncebot>	 kimberly_sarabia, Daimona, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:00:14] <kimberly_sarabia>	 hello
[20:00:38] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515)
[20:00:43] <MatmaRex>	 hi
[20:00:57] <TheresNoTime>	 hi, I can deploy :)
[20:01:09] <brennen>	 whew
[20:01:12] <TheresNoTime>	 :p
[20:01:24] <TheresNoTime>	 kimberly_sarabia: going to start with your beta config patch, 923407, get that out the way
[20:01:27] <Daimona>	 o/
[20:01:44] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "prep for deploy" [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[20:01:49] <kimberly_sarabia>	 TheresNoTime: Thanks
[20:01:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia)
[20:02:19] <wikibugs>	 (03PS5) 10Samtar: [prod] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy)
[20:02:35] <MatmaRex>	 i am struggling a bit with my backport, but i should have it sorted out in a few minutes
[20:02:52] <TheresNoTime>	 Daimona: I'll then do your config patch, 919838, while the vector one merges if that's okay?
[20:03:01] <MatmaRex>	 the UBNs always come in minutes before the last deployment slot of the week
[20:03:02] <Daimona>	 Sure, ty!
[20:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector "Zebra" AB test to enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923407 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia)
[20:03:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy)
[20:04:37] <wikibugs>	 (03Merged) 10jenkins-bot: [prod] Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T337365) (owner: 10Daimona Eaytoy)
[20:05:06] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]]
[20:05:10] <stashbot>	 T337365: Enable CampaignEvents logging in beta and production - https://phabricator.wikimedia.org/T337365
[20:05:21] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10wiki_willy) a:03Jhancock.wm
[20:05:45] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337276 (10wiki_willy) a:03Jhancock.wm
[20:06:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:06:40] <logmsgbot>	 !log samtar@deploy1002 samtar and daimona: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:06:57] <TheresNoTime>	 Daimona: that's live on mwdebug, can you test?
[20:07:29] <Daimona>	 I don't think it's testable, because no logs can be generated for that channel yet
[20:07:46] <TheresNoTime>	 will sync :)
[20:07:46] <Daimona>	 The best I could do is use shell.php to log something manually, but I'm not sure if it's desirable
[20:07:54] <Daimona>	 Or if there's a smarter way to do that
[20:08:16] <TheresNoTime>	 I've started to sync now
[20:08:27] <Daimona>	 Ok, thanks :)
[20:08:32] <TheresNoTime>	 kimberly_sarabia: your config change should be on beta now-ish 
[20:08:48] <kimberly_sarabia>	 TheresNoTime: Thanks
[20:11:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:12:27] <wikibugs>	 (03PS1) 10Hashar: wm-patch-demo: use WARNING to prevent chipset collapsing [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923418 (https://phabricator.wikimedia.org/T332474)
[20:12:35] <bblack>	 jouncebot: next
[20:12:35] <jouncebot>	 In 9 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230526T0600)
[20:12:50] <bblack>	 any deploys from above still ongoing?
[20:13:01] <TheresNoTime>	 bblack: yes
[20:13:06] <bblack>	 ok
[20:13:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:13:37] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:919838|[prod] Configure logging for the CampaignEvents channel (T337365)]] (duration: 08m 31s)
[20:13:40] <TheresNoTime>	 there's two wmf.10 backports left to do
[20:13:42] <stashbot>	 T337365: Enable CampaignEvents logging in beta and production - https://phabricator.wikimedia.org/T337365
[20:13:51] <TheresNoTime>	 Daimona: live on prod :)
[20:14:41] <Daimona>	 Amazing, thank you!
[20:14:58] <TheresNoTime>	 kimberly_sarabia: moving on to 923281, needs a few more minutes to merge though
[20:15:05] <kimberly_sarabia>	 ok
[20:15:49] <TheresNoTime>	 bblack: did you want me to hold off of starting the merge for the next .10 backport?
[20:15:55] <bblack>	 no go ahead
[20:15:58] <TheresNoTime>	 ack :)
[20:16:07] <bblack>	 I'm just waiting for an idle time to lock up scap and do some SRE-level things later
[20:18:27] <TheresNoTime>	 MatmaRex: did you sort your patch, can I start it merging?
[20:18:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:19:03] <MatmaRex>	 TheresNoTime: if it passes tests, yes. i'm waiting to confirm that :D
[20:19:24] <TheresNoTime>	 looks like it has?
[20:19:41] <MatmaRex>	 yep
[20:20:05] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "prep for deploy" [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) (owner: 10Bartosz Dziewoński)
[20:20:31] <wikibugs>	 (03Merged) 10jenkins-bot: Use document feature classes to extract A/B test state [skins/Vector] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923281 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[20:21:06] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]]
[20:21:10] <stashbot>	 T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972
[20:22:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder)
[20:22:35] <logmsgbot>	 !log samtar@deploy1002 jdrewniak and samtar: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:22:59] <TheresNoTime>	 kimberly_sarabia - live on mwdebug, can you test?
[20:23:06] <kimberly_sarabia>	 sure 
[20:26:11] <kimberly_sarabia>	 TheresNoTime: LGTM!
[20:26:16] <wikibugs>	 (03Abandoned) 10Ottomata: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu)
[20:26:17] <TheresNoTime>	 syncing :)
[20:32:04] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:923281|Use document feature classes to extract A/B test state (T335972)]] (duration: 10m 58s)
[20:32:08] <TheresNoTime>	 and live :)
[20:32:09] <stashbot>	 T335972: Launch content separation (Zebra #9) A/B test - https://phabricator.wikimedia.org/T335972
[20:34:26] <kimberly_sarabia>	 TheresNoTime: TYSM!
[20:34:33] <TheresNoTime>	 you're welcome!
[20:37:36] <wikibugs>	 (03Merged) 10jenkins-bot: Manual backport of OOUI change I63293edd62 (tab dialog fix) [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923282 (https://phabricator.wikimedia.org/T337515) (owner: 10Bartosz Dziewoński)
[20:38:23] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]]
[20:38:28] <stashbot>	 T337515: OOUI dialogs with tabs can't be interacted with (except the last tab), e.g. VE image dialog - https://phabricator.wikimedia.org/T337515
[20:40:04] <logmsgbot>	 !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:40:07] <TheresNoTime>	 MatmaRex: live on mwdebug
[20:40:33] <MatmaRex>	 looking
[20:41:20] <MatmaRex>	 TheresNoTime: looks good, thank you
[20:41:26] <TheresNoTime>	 syncing
[20:45:01] <wikibugs>	 (03Restored) 10Ladsgroup: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[20:45:20] <wikibugs>	 (03PS2) 10Thcipriani: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[20:45:51] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:55] <thcipriani>	 heh, 20 seconds slower
[20:46:58] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:923282|Manual backport of OOUI change I63293edd62 (tab dialog fix) (T337515)]] (duration: 08m 34s)
[20:47:02] <TheresNoTime>	 MatmaRex: and live :)
[20:47:03] <stashbot>	 T337515: OOUI dialogs with tabs can't be interacted with (except the last tab), e.g. VE image dialog - https://phabricator.wikimedia.org/T337515
[20:47:05] <MatmaRex>	 thanks!
[20:47:27] <TheresNoTime>	 bblack: done with the backports afaik
[20:47:41] <TheresNoTime>	 !log close UTC late backport
[20:47:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[20:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:51] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:51:05] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:56:05] <wikibugs>	 (03PS7) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531)
[20:56:19] <wikibugs>	 (03CR) 10BCornwall: "Thanks for all the review, everyone." [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[20:58:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[21:02:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney)
[21:14:07] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:53] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@77cf676]: (no justification provided)
[21:26:02] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@77cf676]: (no justification provided) (duration: 00m 08s)
[21:42:49] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:42:53] <wikibugs>	 (03PS1) 10Effie Mouzeli: conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366)
[21:43:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) (owner: 10Effie Mouzeli)
[21:43:53] <wikibugs>	 (03PS2) 10Effie Mouzeli: conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366)
[21:51:59] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:55:16] <wikibugs>	 (03PS3) 10Zabe: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[21:56:39] <wikibugs>	 (03PS1) 10Zabe: Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536)
[21:57:02] <wikibugs>	 (03PS4) 10Zabe: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[22:01:53] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536) (owner: 10Zabe)
[22:02:07] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[22:04:22] <wikibugs>	 (03Merged) 10jenkins-bot: Replace deprecated Hooks::runWithoutAbort [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923283 (https://phabricator.wikimedia.org/T335536) (owner: 10Zabe)
[22:04:27] <wikibugs>	 (03Merged) 10jenkins-bot: BannerRenderer: Make sure the language variant is valid [extensions/CentralNotice] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923276 (https://phabricator.wikimedia.org/T337427) (owner: 10Ladsgroup)
[22:05:28] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]]
[22:05:34] <stashbot>	 T337427: LanguageConverter: Call to member function replace() on null - https://phabricator.wikimedia.org/T337427
[22:05:34] <stashbot>	 T335536: Hard deprecate class Hooks with all deprecated functions (and remove in 1.42) - https://phabricator.wikimedia.org/T335536
[22:06:59] <logmsgbot>	 !log zabe@deploy1002 zabe and ladsgroup: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:10:07] <wikibugs>	 (03PS8) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531)
[22:13:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[22:14:42] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:923283|Replace deprecated Hooks::runWithoutAbort (T335536)]], [[gerrit:923276|BannerRenderer: Make sure the language variant is valid (T337427)]] (duration: 09m 14s)
[22:14:48] <stashbot>	 T337427: LanguageConverter: Call to member function replace() on null - https://phabricator.wikimedia.org/T337427
[22:14:49] <stashbot>	 T335536: Hard deprecate class Hooks with all deprecated functions (and remove in 1.42) - https://phabricator.wikimedia.org/T335536
[22:19:03] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:42] <wikibugs>	 (03PS1) 10EoghanGaffney: Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429
[22:30:21] <icinga-wm>	 PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[22:31:33] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:40] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41359/console" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney)
[22:34:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney)
[22:38:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me in compiler: https://puppet-compiler.wmflabs.org/output/921244/41360/doc2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney)
[22:44:50] <wikibugs>	 (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[22:45:27] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41362/console" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney)
[22:53:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] mwlog: remove redis instance [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron)
[22:55:53] <wikibugs>	 (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[22:58:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: set ipv4 and ipv6 for 15 and annual blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[23:00:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "maybe disable puppet on all 4 releases* hosts, stop rsyncd on all 4 hosts, then merge this, double check the timers it creates.. then enab" [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney)
[23:00:55] <icinga-wm>	 RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[23:02:07] <wikibugs>	 (03PS1) 10Dzahn: Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check" [puppet] - 10https://gerrit.wikimedia.org/r/923284
[23:02:49] <wikibugs>	 (03CR) 10Dzahn: "Info: Retrieving locales" [puppet] - 10https://gerrit.wikimedia.org/r/923284 (owner: 10Dzahn)
[23:03:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "unfortunately this fails because there is no IPv6 AAAA record for discovery names" [puppet] - 10https://gerrit.wikimedia.org/r/923342 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[23:06:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check" [puppet] - 10https://gerrit.wikimedia.org/r/923284 (owner: 10Dzahn)
[23:32:33] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10phaultfinder)
[23:46:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack trove hacks: update a patch to match the upstream patch in progress at [puppet] - 10https://gerrit.wikimedia.org/r/923436
[23:46:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack trove hacks: update a patch to match the upstream patch in progress at [puppet] - 10https://gerrit.wikimedia.org/r/923436 (owner: 10Andrew Bogott)
[23:47:35] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436
[23:48:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436 (owner: 10Andrew Bogott)
[23:57:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale