[00:01:27] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:25] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [00:15:43] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [00:26:13] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:03] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [01:01:15] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 67.12 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:02:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:27] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:23:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:29] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:06] (03PS1) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [01:27:30] (03CR) 10RLazarus: "(Excuse me sending this over the weekend and please don't review until your working hours -- I'm just shifting my schedule around.)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [01:30:17] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:41] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [01:35:28] (03PS2) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [01:39:47] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:40:14] (03PS3) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [01:43:37] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:47:29] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [01:48:22] (03PS4) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [01:48:59] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) [01:49:23] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 57.97 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [01:51:32] (03CR) 10RLazarus: "(Excuse me sending this over the weekend and please don't review until your working hours -- I'm just shifting my schedule around.)" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [01:53:13] (03PS2) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) [02:02:11] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:19] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:37] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 111.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [03:26:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:23] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 133.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [04:01:34] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 33 hosts with reason: Primary switchover s4 T289650 [04:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:46] T289650: Switchover s4 from db2090 to db2110 - https://phabricator.wikimedia.org/T289650 [04:07:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 33 hosts with reason: Primary switchover s4 T289650 [04:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2110 with weight 0 T289650', diff saved to https://phabricator.wikimedia.org/P17219 and previous config saved to /var/cache/conftool/dbconfig/20210906-040740-root.json [04:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:33] (03PS3) 10Marostegui: wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) [04:19:36] (03PS3) 10Marostegui: mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) [04:20:17] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [04:25:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [04:29:30] 3We are going to be switching s4 primary master in 30 minutes [04:30:09] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:30:52] (03PS1) 10Marostegui: db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/718938 (https://phabricator.wikimedia.org/T288803) [04:31:45] (03CR) 10Marostegui: [C: 03+2] db2090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/718938 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [04:47:05] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:49:01] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [05:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210905T0700) [05:00:04] kormat and marostegui: Time to snap out of that daydream and deploy Database primary switchover for s4. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T0500). [05:00:10] o/ [05:00:10] let0s go? [05:00:32] !log Starting s4 codfw failover from db2090 to db2110 - T289650 [05:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:37] T289650: Switchover s4 from db2090 to db2110 - https://phabricator.wikimedia.org/T289650 [05:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T289650', diff saved to https://phabricator.wikimedia.org/P17220 and previous config saved to /var/cache/conftool/dbconfig/20210906-050048-root.json [05:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:04] RO confirmed [05:01:09] Same here [05:01:29] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2110 to s4 primary and set section read-write T289650', diff saved to https://phabricator.wikimedia.org/P17221 and previous config saved to /var/cache/conftool/dbconfig/20210906-050140-root.json [05:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:58] all done [05:02:03] I can write [05:02:31] recentchanges is showing activity [05:02:37] marostegui: i can clean up the hearbeat table if you like [05:02:45] kormat: thanks [05:03:40] done [05:03:49] <3 [05:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2090 T289650', diff saved to https://phabricator.wikimedia.org/P17222 and previous config saved to /var/cache/conftool/dbconfig/20210906-050419-marostegui.json [05:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 (current master) from API T289650', diff saved to https://phabricator.wikimedia.org/P17223 and previous config saved to /var/cache/conftool/dbconfig/20210906-050502-marostegui.json [05:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:55] (03CR) 10Marostegui: [C: 03+2] wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [05:07:47] !log Stop replication on db2090 (old s4 master) T289650 T288803 [05:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:52] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:07:52] T289650: Switchover s4 from db2090 to db2110 - https://phabricator.wikimedia.org/T289650 [05:08:50] (03PS1) 10Marostegui: install_server: Reimage db2090 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/718939 (https://phabricator.wikimedia.org/T288803) [05:10:51] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2090 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/718939 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [05:14:31] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:51] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp2039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:31:27] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp2039 is OK: HTTP OK: HTTP/1.0 200 OK - 23523 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:32:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2090.codfw.wmnet with reason: REIMAGE [05:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2090.codfw.wmnet with reason: REIMAGE [05:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (contint1001, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:56:12] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:40] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:14] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:23:57] !log Optimize table dewiki.flaggedtemplates in eqiad T290057 [06:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:02] T290057: Optimize flaggedtemplates tables in production - https://phabricator.wikimedia.org/T290057 [06:26:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:14] (03PS1) 10Elukey: knative-serving,istio: allow only https for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/719035 (https://phabricator.wikimedia.org/T289835) [06:26:47] !log Optimize table bewiki.flaggedtemplates in eqiad T290057 [06:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:15] !log Optimize table mkwiki.flaggedtemplates in eqiad T290057 [06:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:03] (03CR) 10Elukey: [C: 03+1] "LGTM (tested on Firefox)" [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi) [06:48:06] (03CR) 10Elukey: [C: 03+2] knative-serving,istio: allow only https for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/719035 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [06:56:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:22] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:46] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: catch unhandled exception [cookbooks] - 10https://gerrit.wikimedia.org/r/717475 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [07:14:23] (03Merged) 10jenkins-bot: sre.hosts.decommission: catch unhandled exception [cookbooks] - 10https://gerrit.wikimedia.org/r/717475 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [07:15:52] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:20:08] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:25:06] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:55] (03PS1) 10Elukey: knative-serving: set min protocol version to TLSV1_2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719038 (https://phabricator.wikimedia.org/T289835) [07:34:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [07:39:50] (03CR) 10Elukey: [C: 03+2] knative-serving: set min protocol version to TLSV1_2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719038 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [07:39:52] (03CR) 10Vgutierrez: [C: 03+2] varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [07:44:17] !log installing squashfs-tools security updates [07:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:23] (03CR) 10MSantos: [C: 03+1] "ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/716239 (owner: 10MSantos) [07:45:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:29] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) Update from previous comm... [07:51:54] !log installing libssh security updates [07:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:15] (03CR) 10Ema: [C: 03+2] varnish: add tests for unknown XCPS session reuse [puppet] - 10https://gerrit.wikimedia.org/r/715952 (https://phabricator.wikimedia.org/T271421) (owner: 10Ema) [07:57:47] !log fail sdw on ms-be1062, reported errors [07:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:11] (03CR) 10Marostegui: [C: 03+2] dbbackups: Migrate s4 generation from db2097 (stretch) to db2139 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/715919 (https://phabricator.wikimedia.org/T288803) (owner: 10Jcrespo) [08:01:24] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:54] (03PS1) 10JMeybohm: Install main version as istioctl plus bash completion [debs/istioctl] - 10https://gerrit.wikimedia.org/r/719040 [08:23:17] (03CR) 10Kosta Harlan: Growth: Remove config that moved on-wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [08:24:15] (03CR) 10JMeybohm: "I've added 1.11.2 just because it's the latest relase" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/719040 (owner: 10JMeybohm) [08:26:14] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:05] (03PS14) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [08:30:33] (03PS1) 10Jelto: remove backup crontab managed by Ansible [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/719041 (https://phabricator.wikimedia.org/T283076) [08:39:13] (03PS1) 10Muehlenhoff: Failover idp.wikimedia.org to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/719043 [08:39:40] (03CR) 10JMeybohm: [C: 03+1] check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 (owner: 10Ahmon Dancy) [08:45:06] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 42.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:45:42] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.wikimedia.org to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/719043 (owner: 10Muehlenhoff) [08:45:57] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: Misc JS clean ups [software] - 10https://gerrit.wikimedia.org/r/717651 (owner: 10Krinkle) [08:46:23] !log update networking fact - gerrit:715943 [08:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:40] (03CR) 10Jbond: [C: 03+2] facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [08:47:00] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:47:29] (03CR) 10Jbond: [C: 03+2] facter networking: override the networking.ip6 fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [08:47:43] (03PS16) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [08:47:45] (03PS22) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [08:47:47] (03PS23) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [08:47:49] (03PS15) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [08:47:51] (03PS15) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [08:47:53] (03PS1) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [08:48:26] (03CR) 10JMeybohm: mediawiki-dev: Run setup-db as helm hook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm) [08:49:03] (03Abandoned) 10JMeybohm: mediawiki-dev: Run setup-db as helm hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm) [08:49:59] (03CR) 10JMeybohm: [C: 03+1] "I'm happy with this as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [08:50:52] (03PS2) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [08:51:43] (03CR) 10Volans: "Thanks for the patch! In general looks good to me, there are a couple of comments, the rest are all small nits/questions." [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [08:51:44] PROBLEM - MegaRAID on ms-be1062 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:51:45] ACKNOWLEDGEMENT - MegaRAID on ms-be1062 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T290416 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:51:49] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10ops-monitoring-bot) [08:52:44] (03CR) 10jerkins-bot: [V: 04-1] sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:54:21] (03PS3) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [08:56:26] 10SRE, 10MediaWiki-Uploading: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) [08:56:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:02:42] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [09:03:46] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:03] !log restart blazegraph and updater on wdqs1007 [09:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:30] (03PS4) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [09:06:39] (03CR) 10Volans: [C: 04-1] "Couple of things that need to be fixed, looks good otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:08:48] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:09:08] the statograph post is an expired downtime fyi, I'll open a task [09:09:53] volans: I manually offlined a PD 21 on ms-be1062, and /usr/local/lib/nagios/plugins/get-raid-status-megacli does indeed show it, but I haven't seen a task yet, expected ? [09:10:04] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10hashar) [09:12:11] (03CR) 10Elukey: "The change LGTM! I have to make a note though, namely that I considered the `istioctl` alias dangerous (this is why I have avoided it for " [debs/istioctl] - 10https://gerrit.wikimedia.org/r/719040 (owner: 10JMeybohm) [09:12:34] godog: looking if icinga is alerting [09:12:49] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10hashar) Files are dated 2021-09-05 15:18:57 and we have @Andrew login at that time: ` andrew pts/... [09:13:11] godog: https://phabricator.wikimedia.org/T290416 [09:13:52] Mon 10:51:49 wikibugs| SRE, ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (ops-monitoring-bot) [09:14:12] seems to have worked fine to me, unless I'm missing something [09:15:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31010/console" [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:16:33] (03CR) 10Elukey: [C: 03+1] echoserver: Add echoserver debug container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 (owner: 10JMeybohm) [09:17:39] volans: gah, my bad I totally missed it! thank you [09:18:38] (03CR) 10Elukey: [C: 03+1] "Note: I haven't checked in dept what teplates.lua and the full nginx config do, for a simple echo server it seems a lot of configuration b" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 (owner: 10JMeybohm) [09:19:31] no prob :) [09:21:26] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10hashar) From SAL: **2021-09-05** `lang=irc 15:15 changing the puppetmaster for... [09:22:45] !log depooling wdqs1007, catching up on lag [09:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:37] 10SRE, 10DBA, 10Traffic, 10User-Ladsgroup, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Marostegui) [09:25:40] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:34] 10SRE, 10LDAP-Access-Requests: Absent jkroll ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290413 (10Addshore) [09:27:37] 10SRE, 10LDAP-Access-Requests: Absent johlig ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290412 (10Addshore) [09:27:39] 10SRE, 10LDAP-Access-Requests: Absent ataherivand ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290411 (10Addshore) [09:27:43] 10SRE, 10LDAP-Access-Requests: Absent bteshome ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290410 (10Addshore) [09:34:06] (03PS1) 10Muehlenhoff: Remove privileged LDAP access for jkroll [puppet] - 10https://gerrit.wikimedia.org/r/719046 (https://phabricator.wikimedia.org/T290413) [09:36:10] (03CR) 10Muehlenhoff: [C: 03+2] Remove privileged LDAP access for jkroll [puppet] - 10https://gerrit.wikimedia.org/r/719046 (https://phabricator.wikimedia.org/T290413) (owner: 10Muehlenhoff) [09:36:27] can an op swap me with akosiaris for clinic duty? tia [09:36:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent jkroll ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290413 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks, access to the wmde and nda groups has been removed. [09:37:14] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) The average backup speed is now around 50 files/s, with a 3% overhead over normal traffic. We had backed up almost 20... [09:39:35] (03PS1) 10Muehlenhoff: Remove privileged LDAP access for johlig [puppet] - 10https://gerrit.wikimedia.org/r/719047 (https://phabricator.wikimedia.org/T290412) [09:40:06] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:37] (03PS2) 10Ema: rsyslog: expand output lookup table docs [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T206454) [09:40:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove privileged LDAP access for johlig [puppet] - 10https://gerrit.wikimedia.org/r/719047 (https://phabricator.wikimedia.org/T290412) (owner: 10Muehlenhoff) [09:41:21] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent johlig ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290412 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks, access to the wmde group has been removed. [09:43:02] (03CR) 10Ema: rsyslog: expand output lookup table docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T206454) (owner: 10Ema) [09:43:41] (03PS1) 10Muehlenhoff: Remove privileged LDAP access for ataherivand [puppet] - 10https://gerrit.wikimedia.org/r/719048 (https://phabricator.wikimedia.org/T290411) [09:43:48] (03CR) 10Ema: [C: 03+2] rsyslog: expand output lookup table docs [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T206454) (owner: 10Ema) [09:44:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove privileged LDAP access for ataherivand [puppet] - 10https://gerrit.wikimedia.org/r/719048 (https://phabricator.wikimedia.org/T290411) (owner: 10Muehlenhoff) [09:44:50] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent ataherivand ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290411 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks, access to the wmde group has been removed. [09:45:49] marostegui: thank you <3 [09:47:14] (03PS1) 10Muehlenhoff: Remove privileged LDAP access for btheshome [puppet] - 10https://gerrit.wikimedia.org/r/719049 (https://phabricator.wikimedia.org/T290410) [09:47:16] (03PS5) 10Volans: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) [09:47:31] (03CR) 10jerkins-bot: [V: 04-1] Remove privileged LDAP access for btheshome [puppet] - 10https://gerrit.wikimedia.org/r/719049 (https://phabricator.wikimedia.org/T290410) (owner: 10Muehlenhoff) [09:47:44] (03CR) 10Volans: "Rebased resolving conflicts, no changes." [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [09:47:46] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:59] (03PS3) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 [09:51:40] (03PS2) 10Muehlenhoff: Remove privileged LDAP access for btheshome [puppet] - 10https://gerrit.wikimedia.org/r/719049 (https://phabricator.wikimedia.org/T290410) [09:53:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove privileged LDAP access for btheshome [puppet] - 10https://gerrit.wikimedia.org/r/719049 (https://phabricator.wikimedia.org/T290410) (owner: 10Muehlenhoff) [09:54:21] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Absent bteshome ldap account (including ldap/wmde removal) - https://phabricator.wikimedia.org/T290410 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks, access for the nda and wmde groups has been removed. [09:54:22] (03PS1) 10Hnowlan: Assume default on single-instance hosts [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) [09:54:40] (03CR) 10Filippo Giunchedi: clinic-duty: add equinix maint support (032 comments) [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi) [09:56:52] (03PS1) 10Ema: rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) [10:00:23] (03CR) 10Ema: "Take two, I've added the skip-rsyslog logic to /etc/rsyslog.d/20-trafficserver.conf." [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [10:01:20] (03PS1) 10Filippo Giunchedi: admin: replace christinedk key [puppet] - 10https://gerrit.wikimedia.org/r/719053 (https://phabricator.wikimedia.org/T290279) [10:02:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:04] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:12] (03PS1) 10Urbanecm: renameRestrictions.php: Update protected_titles as well [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719069 (https://phabricator.wikimedia.org/T290398) [10:07:22] (03CR) 10Volans: [C: 03+2] decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [10:07:34] (03PS5) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [10:08:45] (03CR) 10jerkins-bot: [V: 04-1] cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [10:08:52] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:08] (03CR) 10Hnowlan: [C: 03+2] maps: re-enable OSM sync and tile generation in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/716239 (owner: 10MSantos) [10:12:16] (03CR) 10Ema: [C: 04-1] "Nice, but there's still a rough edge!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [10:12:31] (03Merged) 10jenkins-bot: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [10:13:19] (03PS2) 10Volans: ipmi: refactor class signature [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 [10:14:00] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:16:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimed [10:16:30] iki/PyBal [10:17:00] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [10:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:34] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs20 [10:17:34] .wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:19:09] (03PS1) 10Jbond: puppet_agent_stats: add catalog version to prom metricts [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) [10:22:47] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet [10:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:54] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `mc1027.eqiad.wmnet` - mc1027.eqiad.wmnet (... [10:26:55] (03PS1) 10Volans: sre.hosts.decommission: fix Icinga exception name [cookbooks] - 10https://gerrit.wikimedia.org/r/719057 (https://phabricator.wikimedia.org/T290326) [10:27:18] (03CR) 10Urbanecm: [C: 03+2] "merging in advance of B&C, to give CI time to process" [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719069 (https://phabricator.wikimedia.org/T290398) (owner: 10Urbanecm) [10:27:25] (03CR) 10Volans: [C: 03+2] ipmi: refactor class signature [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 (owner: 10Volans) [10:27:39] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:43] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10fgiunchedi) [10:27:59] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:28:49] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10Volans) FYI the alert is flapping again. [10:29:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T1030). [10:31:11] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging to unblock decomming of mc1027" [cookbooks] - 10https://gerrit.wikimedia.org/r/719057 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [10:31:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:32:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:16] (03Merged) 10jenkins-bot: ipmi: refactor class signature [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 (owner: 10Volans) [10:34:28] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix Icinga exception name [cookbooks] - 10https://gerrit.wikimedia.org/r/719057 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [10:35:03] (03PS2) 10Volans: setup.py: revert upper limit for regex [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 [10:38:18] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [10:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:13] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts mc1027.eqiad.wmnet [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:01] (03Merged) 10jenkins-bot: renameRestrictions.php: Update protected_titles as well [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719069 (https://phabricator.wikimedia.org/T290398) (owner: 10Urbanecm) [10:48:21] ^^i'll deploy that in ~10 minutes^^ [10:50:56] (03PS5) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [10:51:53] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:52:23] (03PS1) 10Volans: sre.hosts.decommission: apply Icinga fix for mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/719060 (https://phabricator.wikimedia.org/T290326) [10:52:35] (03CR) 10Volans: [C: 03+2] setup.py: revert upper limit for regex [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 (owner: 10Volans) [10:56:01] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:57:45] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:57:53] (03PS1) 10Muehlenhoff: Update puppetised java.security file for Java 11.0.12 [puppet] - 10https://gerrit.wikimedia.org/r/719064 [10:58:57] (03Merged) 10jenkins-bot: setup.py: revert upper limit for regex [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 (owner: 10Volans) [10:59:01] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:59:56] (03PS5) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T1100). [11:00:04] Urbanecm: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:36] i'll self-serve [11:00:51] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:02:05] (03PS2) 10Urbanecm: Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) [11:02:15] (03CR) 10Urbanecm: [C: 03+2] Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) (owner: 10Urbanecm) [11:02:34] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.21/maintenance/renameRestrictions.php: 18e43ecca7d25d2d93de2f98f3bf5b36f5d4b780: renameRestrictions.php: Update protected_titles as well (T290398) (duration: 00m 59s) [11:02:35] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:39] T290398: renameRestrictions.php should also update protected_titles table - https://phabricator.wikimedia.org/T290398 [11:03:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:04] (03Merged) 10jenkins-bot: Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) (owner: 10Urbanecm) [11:03:37] (03PS4) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) [11:03:40] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) (owner: 10Urbanecm) [11:04:16] (03PS6) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [11:04:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f90862be8c7b540065da24c24f2e2ac0df5b9d07: Growth: Define wgGEMentorDashboardDiscoveryEnabled (T289054) (duration: 00m 58s) [11:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:29] T289054: Mentor dashboard: Add discovery features - https://phabricator.wikimedia.org/T289054 [11:04:29] (03CR) 10JMeybohm: echoserver: Add echoserver debug container (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 (owner: 10JMeybohm) [11:04:46] (03Merged) 10jenkins-bot: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) (owner: 10Urbanecm) [11:06:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c8d7cf8f7c3faaf3773940e96ba0cf599e725237: foundationwiki: Create editor group (T205352) (duration: 00m 57s) [11:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:59] T205352: Create new user group for editors on Governance wiki - https://phabricator.wikimedia.org/T205352 [11:07:02] !log EU B&C window done [11:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:37] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:10:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:03] (03PS8) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [11:22:54] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31011/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:25:36] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31012/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:26:09] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:15] 10SRE, 10serviceops: Cloud VPS alert][packaging] Puppet failure on builder-envoy-03.packaging.eqiad.wmflabs - https://phabricator.wikimedia.org/T290430 (10JMeybohm) 05Open→03Resolved p:05Triage→03Medium [11:26:37] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:44] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:32:19] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:59] (03PS2) 10Hnowlan: profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) [11:37:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31013/console" [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:40:31] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [11:48:47] !log [urbanecm@mwmaint2002 ~]$ mwscript renameRestrictions.php --wiki=arwiki 'autoreview' 'editautoreviewprotected' # T230103 [11:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:52] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [11:50:19] !log [urbanecm@mwmaint2002 ~]$ mwscript renameRestrictions.php --wiki=dewiktionary 'autoreviewprotected' 'editautoreviewprotected' # T230103 [11:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:28] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:52:05] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:53:37] !log [urbanecm@mwmaint2002 ~]$ mwscript renameRestrictions.php --wiki=etwiki 'autopatrol' 'editautopatrolprotected' # T230103 [11:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:13] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:27] !log [urbanecm@mwmaint2002 ~]$ mwscript renameRestrictions.php --wiki={hewiki,lvwiki,srwiki,srwikibooks} 'autopatrol' 'editautopatrolprotected' # T230103 [11:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:32] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [11:58:22] !log [urbanecm@mwmaint2002 ~]$ mwscript renameRestrictions.php --wiki=plwiki 'editor' 'editeditorprotected' # T230103 [11:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:11] (03PS1) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [12:02:23] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:46] (03CR) 10Jbond: "this is not completed but thought you may have some early feedback" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [12:03:00] (03CR) 10jerkins-bot: [V: 04-1] wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [12:03:09] 10SRE, 10MediaWiki-extensions-Translate, 10Datacenter-Switchover, 10Performance-Team (Radar), 10Wikimedia-production-error: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) [12:04:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:55] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:06:03] !log silence statograph until thurs on alert1001 - T290425 [12:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:12] T290425: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 [12:08:03] /away [12:08:06] almost! [12:09:08] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [12:16:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:19] (03CR) 10Ema: "A couple of nits." [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:17:05] (03CR) 10MVernon: [C: 03+2] prometheus: couple mysqld export service to mariadb (multi-instance) [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [12:24:26] (03PS2) 10Alexandros Kosiaris: Add a timeout parameter [software/benchmw] - 10https://gerrit.wikimedia.org/r/716371 [12:24:28] (03PS1) 10Alexandros Kosiaris: Using full names instead of shorthands [software/benchmw] - 10https://gerrit.wikimedia.org/r/719103 [12:24:30] (03PS1) 10Alexandros Kosiaris: Fix title of load test [software/benchmw] - 10https://gerrit.wikimedia.org/r/719104 [12:24:32] (03PS1) 10Alexandros Kosiaris: Add the ability to generate comparisions of latency percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/719105 [12:25:01] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:39] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:33] 10SRE-Access-Requests: Requesting to replace my old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10JAbrams) [12:37:11] (03PS1) 10Alexandros Kosiaris: Bump mwdebug quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/719106 [12:38:04] (03PS1) 10Filippo Giunchedi: icinga: remove check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/719107 (https://phabricator.wikimedia.org/T281359) [12:38:59] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump mwdebug quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/719106 (owner: 10Alexandros Kosiaris) [12:40:37] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:46] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:01] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:25] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:43] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:57] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:48:23] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrie [12:48:23] ost-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [12:52:39] (03PS2) 10Gehel: query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [12:53:47] (03CR) 10Gehel: [C: 03+2] query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [12:56:43] (03PS1) 10Jbond: P:java: add spec tests for java profile [puppet] - 10https://gerrit.wikimedia.org/r/719109 [12:57:45] (03CR) 10Jbond: "see comments and nits inline. the -1 are around adding your own spec helper. this shouldn't be needed. the shared spec helper configure" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [13:01:09] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:12] 10SRE, 10SRE-Access-Requests: Replace JAbrams' old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10Aklapper) [13:04:27] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:41] (03CR) 10Vgutierrez: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [13:09:12] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) >>! In T251305#6886431, @JMeybohm wrote: > helm test annotations changed a bit: > >> Note that until Helm v3, the job definition needed to contain one of these helm test hook ann... [13:14:10] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 3 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) Metawiki currently has 38... [13:18:59] (03PS6) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [13:19:14] (03CR) 10Vgutierrez: sslcert: Provide chained TLS cert with private key (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:26:29] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:11] (03CR) 10Jelto: [V: 03+1] gitlab::backup move backup cronjobs to puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:33:49] (03CR) 10Volans: "The code looks sane to me, but I'm not familiar of the effects/overhead on the prometheus side to add this info with such cardinality. If " [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [13:35:43] jouncebot: now [13:35:43] No deployments scheduled for the next 3 hour(s) and 24 minute(s) [13:35:47] jouncebot: next [13:35:48] In 3 hour(s) and 24 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T1700) [13:41:10] (03PS1) 10Effie Mouzeli: ProductionServices: fix comment for rdb* servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719118 [13:42:45] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7334407, @Jelto wrote: > I checked all charts for deprecated and removed helm annotations. We don't use `"helm.sh/hook": test-failure`. This annotation is remove... [13:42:58] !log updated thirdparty/gitlab component to 14.0.10 T284811 [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:02] (03CR) 10jerkins-bot: [V: 04-1] ProductionServices: fix comment for rdb* servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719118 (owner: 10Effie Mouzeli) [13:43:03] T284811: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 [13:43:09] (03PS1) 10MMandere: puppetmaster: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719119 (https://phabricator.wikimedia.org/T282787) [13:44:12] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10akosiaris) With @jijiki we went ahead and create some percentiles comparisons between `mw2254` and `pinkunicorn`. We chose to have the exact same number of php fpm workers (96) as an inv... [13:45:24] (03CR) 10Ema: [C: 03+1] sslcert: Provide chained TLS cert with private key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:45:29] (03PS2) 10Effie Mouzeli: ProductionServices: fix comment for rdb* servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719118 [13:48:11] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: fix comment for rdb* servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719118 (owner: 10Effie Mouzeli) [13:48:26] (03CR) 10Jbond: Update puppetised java.security file for Java 11.0.12 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [13:48:31] (03CR) 10Muehlenhoff: cassandra: use profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [13:49:10] (03Merged) 10jenkins-bot: ProductionServices: fix comment for rdb* servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719118 (owner: 10Effie Mouzeli) [13:50:12] (03CR) 10Muehlenhoff: Update puppetised java.security file for Java 11.0.12 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [13:51:30] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:719118|ProductionServices: fix comment for rdb* servers]] (duration: 00m 58s) [13:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:01] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:53:33] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10akosiaris) [13:53:54] (03PS4) 10Jbond: facter networking: filter out cali/tap interfaces [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) [13:55:38] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10akosiaris) One word of warning. Make sure that the new version of qemu on buster is compatible with the old one (e.g. regarding migrations). It should be, but in the past they 've b... [13:56:54] !log update facter networking fact gerrit:715949 [13:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:58] (03CR) 10Jbond: [C: 03+2] facter networking: filter out cali/tap interfaces [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:59:00] (03PS1) 10MVernon: pc1: remove puppet entries for pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) [14:02:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:45] !log re-pooling wdqs1007, catched up on lag [14:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:04] (03CR) 10Jbond: [C: 03+1] sre.hosts.decommission: apply Icinga fix for mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/719060 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [14:08:42] (03CR) 10Kormat: "One comment, the rest looks good." [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) (owner: 10MVernon) [14:10:13] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:10:18] (03PS5) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) [14:10:25] jouncebot: now [14:10:25] No deployments scheduled for the next 2 hour(s) and 49 minute(s) [14:10:30] noice [14:10:37] deploying the url shortener thingy [14:11:02] (03CR) 10Ladsgroup: [C: 03+2] Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [14:11:50] (03Merged) 10jenkins-bot: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [14:12:36] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: apply Icinga fix for mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/719060 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [14:12:42] !log installing postgres 9.6 security updates [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:42] (03CR) 10Jbond: puppet_agent_stats: add catalog version to prom metricts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [14:15:15] (03Merged) 10jenkins-bot: sre.hosts.decommission: apply Icinga fix for mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/719060 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [14:15:43] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article [14:15:43] nuary 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:17:23] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:715492|Set permission of creating short url to everyone everywhere (T267921 T267925)]], Part I (duration: 00m 59s) [14:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] T267925: Allow displaying URL shortener link in sidebar for foreign wiki - https://phabricator.wikimedia.org/T267925 [14:17:30] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [14:17:52] (03PS2) 10Muehlenhoff: Update puppetised java.security file for Java 11.0.12 [puppet] - 10https://gerrit.wikimedia.org/r/719064 [14:19:13] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715492|Set permission of creating short url to everyone everywhere (T267921 T267925)]], Part II (duration: 00m 57s) [14:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:27] (03PS2) 10MVernon: pc1007: remove puppet entries for pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) [14:22:37] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [14:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) (owner: 10MVernon) [14:23:26] (03PS2) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [14:24:10] (03CR) 10jerkins-bot: [V: 04-1] wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [14:26:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:57] (03PS3) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [14:31:49] (03CR) 10Kormat: [C: 03+1] pc1007: remove puppet entries for pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) (owner: 10MVernon) [14:32:25] (03CR) 10MVernon: [C: 03+2] pc1007: remove puppet entries for pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/719120 (https://phabricator.wikimedia.org/T289118) (owner: 10MVernon) [14:32:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [14:33:06] (03PS2) 10Jbond: puppet_agent_stats: add catalog version to prom metricts [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) [14:35:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1007.eqiad.wmnet [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet [14:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:23] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `mc1027.eqiad.wmnet` - mc1027.eqiad.wmnet (... [14:36:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10ChristineDeKock) Hello, Thank you for your help on this. I am still being asked for a password when I connect using "ssh stat1008.eqiad.wmnet". This was not the case before and I'm not su... [14:37:11] (03PS4) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 [14:37:58] (03CR) 10Filippo Giunchedi: clinic-duty: add equinix maint support (031 comment) [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi) [14:38:07] (03CR) 10Volans: "no blockers from me :)" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [14:43:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) [14:44:06] any kind volunteer to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/719053/ ? [14:44:36] !log volans@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1026.eqiad.wmnet [14:44:37] (03CR) 10Kormat: [C: 03+1] admin: replace christinedk key [puppet] - 10https://gerrit.wikimedia.org/r/719053 (https://phabricator.wikimedia.org/T290279) (owner: 10Filippo Giunchedi) [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] godog: nope, just me! ;) [14:45:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) >>! In T290279#7334542, @ChristineDeKock wrote: > Hello, > > Thank you for your help on this. > > I am still being asked for a password when I... [14:45:04] lolz, thank you kormat <3 <3 [14:45:13] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts mc1026.eqiad.wmnet [14:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:19] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: replace christinedk key [puppet] - 10https://gerrit.wikimedia.org/r/719053 (https://phabricator.wikimedia.org/T290279) (owner: 10Filippo Giunchedi) [14:45:39] 10SRE, 10DBA, 10Traffic, 10User-Ladsgroup, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10ayounsi) fyi there was a brief TCP/ACK DDoS toward esams at that time. [14:45:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1007.eqiad.wmnet [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] Emperor: merging your change too [14:45:56] godog: could you wait? [14:46:00] yes [14:46:06] phew, thanks [14:46:22] we're accidentally testing a corner-case in the decom workflow [14:46:37] hah! "huzzah" ? [14:46:39] godog: ok all clear, please proceed 💜 [14:46:46] thank you [14:49:59] !log removing pc1007 from tendril and zarcillo T289118 [14:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:03] T289118: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 [14:51:28] (03PS1) 10Kormat: Revert "db2118: Disable notifications during reimage." [puppet] - 10https://gerrit.wikimedia.org/r/719080 [14:52:11] (03CR) 10Kormat: [C: 03+2] Revert "db2118: Disable notifications during reimage." [puppet] - 10https://gerrit.wikimedia.org/r/719080 (owner: 10Kormat) [14:53:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 25%: reimage to buster T288244', diff saved to https://phabricator.wikimedia.org/P17226 and previous config saved to /var/cache/conftool/dbconfig/20210906-145341-kormat.json [14:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:46] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [14:57:08] (03PS1) 10Filippo Giunchedi: o11y: add udp receive errors for statsd [alerts] - 10https://gerrit.wikimedia.org/r/719123 (https://phabricator.wikimedia.org/T288726) [14:57:43] (03PS1) 10Filippo Giunchedi: statsd: remove statsd_udp_inbound_errors [puppet] - 10https://gerrit.wikimedia.org/r/719124 (https://phabricator.wikimedia.org/T288726) [15:00:06] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:33] !log removing pc1007 from orchestrator T289118 [15:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:37] T289118: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 [15:04:15] (03PS1) 10Filippo Giunchedi: prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) [15:04:40] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10MatthewVernon) [15:04:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) >>! In T290279#7334542, @ChristineDeKock wrote: > Hello, > > Thank you for your help on this. > > I am still being asked for a password when I... [15:07:26] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:10:13] (03PS1) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [15:10:18] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:11:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not super familiar with the exact implications but LGTM" [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [15:15:14] (03CR) 10Filippo Giunchedi: "+1 to the idea, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [15:15:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:27] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:20:06] (03PS2) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [15:21:37] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:24:23] (03CR) 10Elukey: "Hello folks, this is a proposal for the ml-services dir." [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:24:42] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpecte [15:24:42] 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [15:27:46] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:13] mmm nothing weird that I can see from [15:28:14] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=All&from=now-6h&to=now [15:29:50] from the pod logs [15:29:51] {"name":"wikifeeds","hostname":"wikifeeds-production-6db5957576-5ggcv","pid":18,"level":"WARN","err":{"message":"Page or revision not found.","name":"wikifeeds","stack":"HTTPError: Page or revision not found.\n [15:36:40] jayme, jelto --^ if you are around [15:38:01] elukey: in a meeting currently [15:38:04] both of us [15:39:46] ah okok [15:42:42] https://en.wikipedia.org/api/rest_v1/#/Feed/aggregatedFeed works fine, so it may be a monitoring hiccup ? [15:43:55] or maybe that featured article is now gone [15:43:56] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) [15:44:41] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 1.525e+06 ge 2.592e+05 Hnowlan Possibly erroneous post imposm migration - under investigation https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [15:44:41] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 4.67e+05 ge 2.592e+05 Hnowlan Possibly erroneous post imposm migration - under investigation https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [15:44:41] ACKNOWLEDGEMENT - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] Hnowlan Possibly erroneous post imposm migration - under investigation https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [15:45:42] (03PS4) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [15:46:23] (03CR) 10jerkins-bot: [V: 04-1] wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [15:47:38] no it looks working from the API UI [15:47:39] (03CR) 10Jbond: wmflib: puppet prometheus reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [15:50:34] ok I see, wikifeeds-production-tls-proxy is reporting all 503s [15:51:16] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:10] looking at ^ in case it's related [15:53:06] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:58:35] (03PS5) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [15:59:18] (03CR) 10jerkins-bot: [V: 04-1] wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [16:00:31] jouncebot: now [16:00:32] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [16:02:22] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:31] hnowlan: if you have a moment to brainbounce [16:05:43] so I see that the tls proxy logs on the wikifeeds pods are reporting 503 [16:06:08] the container logs for the app looks like it logs 404s related to revision not found [16:06:43] I'm around now (at least for a bit...need to leave soon unfortunately) [16:06:48] hello :) [16:07:02] the metrics for wikifeeds don't indicate anything weird that I can see [16:07:36] elukey: looking (although I think this is the first time I've looked at the service :D) [16:07:59] I've pinged PI to see if they have any insight, wikitech says they're maintainers [16:08:53] thanks! [16:08:59] first time for me too :D [16:10:05] restbase is also getting 503s on the envoy service unsurprisingly which probably triggered the above flap [16:10:09] (03PS1) 10Volans: netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 [16:10:54] er wikifeeds envoy proxy that is [16:12:18] in the container app logs I see some weird stacktraces about HTTP 404s from the mw api (or at least, afaics) [16:12:26] all for service checker, mentioning /en.wikipedia.org/v1/page/most-read/2016/01/01 [16:12:41] I need to run, sorry :/ back in 2 hours potentially... [16:13:31] (no perticular wikifeeds knowledge is leaving with me, though) [16:14:24] !log Deployed patch for T290394 [16:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:05] probably a red herring but something for follow up later- there's been some evictions of the service pods [16:16:11] those 404s look somewhat valid, weirdly [16:16:39] as in they're for pages that don't exist (I think?) [16:17:30] it happened in the past that some test pages for service checker disappeared or changed [16:17:35] and the tests went awol [16:21:49] hnowlan: what is the URI/URL that you see getting a 503 from the restbase side? [16:22:01] /en.wikipedia.org/v1/feed/announcements [16:22:38] which currently seems fine though [16:22:43] ahahahah [16:22:48] yeah very weird [16:23:11] there's in fact only a few every few minutes in the logs now that I look closer [16:23:58] restbase error trends themselves look normal-ish [16:24:10] I checked and codfw is the dc pooled [16:24:13] for wikifeeds [16:25:19] also https://en.wikipedia.org/api/rest_v1/#/Feed/get_feed_announcements works fine [16:25:26] (I mean it returns 200) [16:26:07] hnowlan: what is the external URL to query /en.wikipedia.org/v1/page/most-read/2016/01/01 ? [16:26:36] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:32] (03PS1) 10Volans: sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 [16:29:43] (03CR) 10Volans: "This goes with I43ff5470e513962e4da4c472aa287ecf09f38de7" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans) [16:33:15] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10Patch-For-Review: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 (10Volans) 05Open→03Resolved And with the last patch merged that finally makes use of the wmflib vers... [16:34:50] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) @Cmjohnson Can we please unrack mc1026 last? We are trying fix a bug in the decom process. Thank you! [16:36:17] elukey: not entirely sure, I wasn't aware we have an external most read feed [16:36:36] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:36:37] https://wikitech-static.wikimedia.org/wiki/Wikifeeds#Architecture [16:36:41] hnowlan: ah okok I was wondering the same,we don't have [16:36:59] seems featured uses it to make up the full thing? [16:37:40] hnowlan: I tried several queries and they all look ok (I see for example mostread in https://en.wikipedia.org/api/rest_v1/feed/featured/2016/01/01) [16:37:53] it seems that service checker is not agreeing [16:38:22] I see a [16:38:23] wikifeeds-production-service-checker 0/1 Completed 0 139d [16:38:30] but no idea if it is meant to be in that state or not [16:40:03] ahhh that was eqiad [16:40:25] yeah now I see the evicted pods hnowlan [16:40:29] ooof [16:40:58] not many recent enough that they'd be triggering this though I guess [16:46:00] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:46:46] I'm getting intermittent failures on hitting pages like /fr.wikipedia.org/v1/page/random/title [16:47:28] across languages [16:49:15] (03CR) 10Volans: "This goes with the depends-on related patch (see commit message)" [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans) [16:50:17] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=wikifeeds&var-pod=All&from=now-7d&to=now [16:50:28] mmm the pods look throttled in CPU, am I reading it wrong? [16:51:31] the last change was https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/715452 [16:53:36] I see a lot of pods evicted in other namespaces [16:53:43] hmm yeah they do look throttled [16:54:16] akosiaris: around? [16:54:20] I'm wondering if we should be unscientific and just roll restart to see if anything changes [16:54:49] that change would surface issues sooner than a few days I feel, if it was going to break anything [16:55:02] guessing entirely though on that one :) [16:56:18] do we know why the pods are being evicted? [16:56:26] I would guess disk pressure [16:57:03] on kubernetes2011 I see stuff like [16:57:03] [Mon Sep 6 16:42:32 2021] Memory cgroup out of memory: Kill process 32424 (node) score 1886 or sacrifice child [16:57:06] [Mon Sep 6 16:42:32 2021] Killed process 32424 (node) total-vm:1209324kB, anon-rss:360000kB, file-rss:0kB, shmem-rss:0kB [16:57:11] "Pod The node had condition: [DiskPressure]" [16:57:24] ah! Where did you find it? [16:57:48] `kubectl describe pod wikifeeds-production-6db5957576-58m6c` [16:58:16] very easy nice [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T1700). [17:00:23] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [17:00:29] hnowlan: IIUC this is the kubelet complaining about disk space? [17:01:22] yeah, pods get recreated when disk space gets limited on the kubernetes node - doesn't necessarily relate to the service in question, just that *some* service locally caused the host to come close to running out of space [17:01:42] hnowlan: but in its root partition or elsewhere? [17:01:49] in this case I think it's shellbox [17:01:55] root partition itseems like [17:02:04] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:23] hnowlan: shellbox? [17:06:24] elukey: it's a service for command execution in mw, just currently seems very busy in the logs [17:06:56] hnowlan: yeah I kinda know what it does, but I am wondering if there is any log or trace that points to it [17:07:00] maybe in the kubelet [17:08:26] ah I'm only basing my theories on the fact that it has a big log on kubernetes2017 for example (/var/lib/docker/containers/77177ce5be4b1147c5578e567d5a60d6f7fab9d3a3c4878c2fba08a6d30b303c/77177ce5be4b1147c5578e567d5a60d6f7fab9d3a3c4878c2fba08a6d30b303c-json.log is currently 13GB and very busy) [17:08:37] I don't think the evicted containers would be causing the feeds issue though [17:09:18] the throttling is cause for concern though [17:10:59] yes I agree, even if in theory we should have an alert for evictions [17:11:05] +1 [17:11:38] That said, the throttling doesn't look like it's increased any time around this alert [17:12:41] also it seems that wikifeeds works, and service checker is thinking otherwise [17:13:02] same goes for https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=now-6h&to=now [17:13:25] do we know where service checker runs? [17:13:32] maybe we could end up reproing the 403 [17:13:35] err 503 [17:14:40] the other thing that I don't explain is all the 503s recorded by the tls proxy container [17:15:43] I'm consistently getting a 503 every few requests on doing curl https://wikifeeds.discovery.wmnet:4101/en.wikipedia.org/v1/page/random/title [17:17:25] ok so some requests get through, others get the 503 [17:17:54] gonna have a look at the wikifeeds source [17:19:10] (03PS5) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [17:21:06] hnowlan: if you want to also attempt a roll restart it is fine for me [17:24:22] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [17:26:30] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:18] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [17:27:22] what [17:28:36] it must be service checker being pleased, but I still see the 503s [17:29:08] yep, still seeing them too [17:29:22] my main realisation from reading the source is that I cannot javascript [17:30:45] I wonder if these have been in the background for a while and have been hidden by not hitting some kind of threshold and/or some kind of retry logic somewhere? [17:31:29] it could be something that happens and we just got the alerts for some reason [17:31:38] (that have been happening) [17:32:56] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [17:32:56] e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [17:33:15] That wasn't a long recovery [17:36:40] PROBLEM - HP RAID on ms-be1051 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 4I:5:1, 4I:5:2 - Failed: 3I:3:4 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:36:43] ACKNOWLEDGEMENT - HP RAID on ms-be1051 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 4I:5:1, 4I:5:2 - Failed: 3I:3:4 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T290442 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:36:47] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10ops-monitoring-bot) [17:38:28] elukey: those 503s in the wikifeeds-production-tls-proxy pod are requests to restbase, localhost:6503 is restbase-for-services [17:39:34] hnowlan: I was wondering if it was the TLS terminator for the service or a tls proxy for other ones [17:40:13] hnowlan: so those 503s are wikifeeds calling restbase that calls something else returning 503 [17:40:16] ? [17:40:43] possibly - not sure about the 'something else" part yet [17:41:02] restbase is also seeing errors because the request to wikifeeds goes through restbase 🙃 [17:42:51] this issue is getting better by the minute :D [17:43:07] https://logstash.wikimedia.org/app/dashboards#/view/4802b890-e977-11eb-863c-3588009e4dd9?_g=h@03cd538&_a=h@b683f68 is showing a sustained rate of timeouts [17:43:58] and it started on the 4th [17:45:20] hnowlan: https://logstash.wikimedia.org/goto/d421b9862c86fb6a79e3de41f73f5157 [17:46:05] nothing on the SAL [17:46:07] super weird that they have a hard start at 00:00, is this a retention thing? [17:47:35] if you zoom in it seems that they started around 02:30 UTC [17:48:00] and UA seems to be WikipediaApp [17:48:36] so likely another issue? :D [17:50:18] that deploy to wikifeeds happened on the 2nd but later in the day [17:50:27] at 14:35 [17:52:12] by irc it seems the service has flapped a few times since the 4th [17:52:25] this may need an handover to folks in US-like timezones, it is getting a little late in here (and I guess for you too :) [17:52:40] yeah, today's a public holiday in the US though [17:52:51] snap this is a problem [17:58:53] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10elukey) p:05Triage→03High [17:59:02] first one --^ [18:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:00:51] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10hnowlan) My running theory on this is that shellbox is currently generating a lot of logs (dozens of lines a second) - the file is 12GB on kubernetes2017 atm but could easily be other se... [18:00:58] elukey: thanks for filing [18:01:15] hnowlan: creating another one for wikifeeds [18:02:09] (03PS6) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [18:02:11] (03PS1) 10Volans: prospector: disable E203 for pep-8 over black [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 [18:02:13] (03PS1) 10Volans: style: if no local modifications check last commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 [18:02:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:08] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) [18:09:12] hnowlan: second one --^ [18:09:37] back [18:10:07] jayme: just opened a couple of task with what we found [18:10:09] * jayme reading scrollback [18:12:20] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10hnowlan) [18:13:38] jayme: summary in https://phabricator.wikimedia.org/T290444 and https://phabricator.wikimedia.org/T290445 [18:13:51] the first one is a little weird, lemme know what you think about it [18:17:04] the first one "should" not be a problem (as long as it does not hit too many replicas at once, though) [18:17:42] 10SRE, 10DBA, 10Traffic, 10User-Ladsgroup, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10AlexisJazz) >>! In T290379#7334556, @ayounsi wrote: > fyi there was a brief TCP/ACK DDoS toward esams at that time. I connect to esams. [18:17:49] gotta go afk to cook but will be back on in a little bit [18:19:17] jayme: there is also the throttling for wikifeeds happening (at least this is what I read from the dashboard) [18:19:59] elukey: did you already check when those pods where evicted? [18:21:39] jayme: they seem all due to disk-related limits, I haven't found a correlation with a specific even [18:21:42] *event [18:22:28] elukey: the throttling I see around 2021-09-04 to 2021-09-05 but not that much today [18:23:54] jayme: then I am reading the graphs in the wrong way, the "Throttled" red spiky line that I see in the kubernetes pod details looks constant [18:24:39] it is minus something but I thought it was a different way to display it, if it is indicating something else then I am not reading it well [18:25:01] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:44] going to dinner, will check later :) [18:30:25] (03CR) 10Muehlenhoff: profile::sysctl: add ability to control ip_forward: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [18:30:37] elukey: I'll take a closer look. Maybe send a grafana link if you have at hand [18:31:03] jayme: https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=now-2d&to=now&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=wikifeeds&var-pod=All [18:32:18] elukey: thanks. That I was looking at but it was only a couple of ms throttling since yesterday [18:32:56] couple of hundred .. but not that much compared to its limit of 4 CPUs [18:35:03] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10JMeybohm) Evictions actually happened this morning: ` # kubectl -n wikifeeds get po --field-selector=status.phase=Failed -o custom-columns="NAME:.metadata.name,STATUS:.status.reason,TIME... [18:58:13] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:00:20] jayme: not idea if the throttling (even if minimal) can cause some timeouts, if so it may explain the behavior that Hugh observed about intermittent failure when querying the discovery record [19:02:49] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:51] anyway, very late and it seems that this can wait tomorrow, enjoy your evening :) [19:03:55] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:21:03] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:25:33] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:47] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:34:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:36:21] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:01] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article [19:40:01] nuary 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:40:22] * jayme back from crashing my laptop [19:41:23] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:43:11] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:47:07] (03CR) 10Krinkle: [C: 03+1] clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi) [19:51:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [20:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T2000). [20:01:27] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:44] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=17&orgId=1&from=now-7d&to=now&var-datasource=thanos&var-site=codfw&var-prometheus... [20:08:01] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.733e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [20:12:17] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:17:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:19:55] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:25:35] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:26:07] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:29] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:33:09] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL [20:33:09] etrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [20:37:11] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:18] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) In logstash there is a huge amount of 503/504 upstream errors reported by wikifeeds (the app, not tls-proxy) (https://logstash.wikimedia.org/goto/6f2cd8f9fe... [20:50:15] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:54:27] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10Tgr) >>! In T209149#7333699, @kost... [21:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T2100). [21:01:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:02:07] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:09] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:16:53] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: [21:16:53] tps://wikitech.wikimedia.org/wiki/Wikifeeds [21:20:43] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:22:12] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) I'm pretty tired already, but I kind of feel stuck at the point of wikifeeds envoy keep failing with UF **to restbase** (if I'm not reading this wrong) and... [21:26:25] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:26:45] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:31:41] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:33:29] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:34:01] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [22:02:25] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:02:39] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:09] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retr [22:08:09] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [22:10:03] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:15:45] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL [22:15:45] etrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [22:17:37] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:25:25] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:23] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:49] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [22:48:05] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:59:37] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articl [22:59:37] anuary 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210906T2300). [23:00:05] Tran: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:11] 👋 [23:01:31] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:21] Is anyone available for a backport? The patch was a last minute addition to the schedule. Sorry! 🙏 Maybe @thcipriani? [23:18:35] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:24:19] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [23:26:09] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:03] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:35:47] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articl [23:35:47] anuary 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [23:41:31] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:41:58] (03PS1) 10Tim Starling: Fix display issues when numbers are above 1000 or small [extensions/SecurePoll] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719166 (https://phabricator.wikimedia.org/T290000) [23:42:13] (03CR) 10Tim Starling: [C: 03+2] Fix display issues when numbers are above 1000 or small [extensions/SecurePoll] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719166 (https://phabricator.wikimedia.org/T290000) (owner: 10Tim Starling) [23:46:54] (03Merged) 10jenkins-bot: Fix display issues when numbers are above 1000 or small [extensions/SecurePoll] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719166 (https://phabricator.wikimedia.org/T290000) (owner: 10Tim Starling) [23:47:15] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrie [23:47:15] ost-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [23:52:02] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/SecurePoll/includes/Talliers/STVTallier.php: T290000 (duration: 00m 58s) [23:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:09] T290000: SecurePoll: Tally page display issues when numbers are above 1000 - https://phabricator.wikimedia.org/T290000 [23:52:27] Thank you! 🙇‍♂️ [23:58:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds