[00:07:47] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:41] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:10:23] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:11:17] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 111.05, 101.42, 76.15 https://wikitech.wikimedia.org/wiki/Swift [00:24:41] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:53] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:39] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:51:45] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 58.51, 65.63, 77.75 https://wikitech.wikimedia.org/wiki/Swift [01:30:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:40:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:49] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:29:35] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:11:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:19:41] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:20:45] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:27:23] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:07:39] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:10:55] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:14:45] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:44:43] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:57:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:02:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:02:13] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:07:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:10:59] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:30:35] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:03:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:17:59] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:31:37] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:39:03] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > That might have unwanted implications in case of power or network issues on one row. That's fine, we're moving from a... [06:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:58:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T0700). [07:00:05] koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:03:43] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 66.85 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [07:05:19] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:07:17] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:01] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10ayounsi) [07:13:23] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:13:29] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:30] koi: i see no one picked deploying up yet. are you around? [07:16:13] (03CR) 10Ayounsi: [C: 03+2] Revert "drmrs: add Init7 TE communities" [homer/public] - 10https://gerrit.wikimedia.org/r/791490 (owner: 10Ayounsi) [07:16:21] (03CR) 10Ayounsi: [C: 03+2] Revert "drmrs: add Init7 transit" [homer/public] - 10https://gerrit.wikimedia.org/r/791489 (owner: 10Ayounsi) [07:18:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:18:12] !log restarting blazegraph on wdqs1007 (BlazegraphFreeAllocatorsDecreasingRapidly) [07:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:21:03] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:23:13] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:25:44] urbanecm yes I'm here now, sorry for the delay [07:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:27:13] (03PS10) 10Sergio Gimeno: GrowthExperiments: Update campaigns configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [07:29:37] koi: no problem. let's go ahead if you're still around. [07:29:45] ok [07:29:50] (03Abandoned) 10Sergio Gimeno: Account creation: enable thankyoupage campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791007 (https://phabricator.wikimedia.org/T305659) (owner: 10Sergio Gimeno) [07:31:01] (03PS1) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:32:17] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:32:24] (03CR) 10Urbanecm: [C: 03+2] zhwikisource: Add NS100 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791768 (https://phabricator.wikimedia.org/T308393) (owner: 10Stang) [07:33:11] (03Merged) 10jenkins-bot: zhwikisource: Add NS100 to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791768 (https://phabricator.wikimedia.org/T308393) (owner: 10Stang) [07:33:43] koi: please test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/791768 at mwdebug1001 [07:33:54] looking [07:34:09] (03PS2) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:34:43] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:34:44] lgtm [07:35:19] syncing [07:36:07] (03PS3) 10Urbanecm: thwikibooks: Add more namespaces to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791749 (https://phabricator.wikimedia.org/T308373) (owner: 10Stang) [07:36:10] (03CR) 10Urbanecm: [C: 03+2] thwikibooks: Add more namespaces to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791749 (https://phabricator.wikimedia.org/T308373) (owner: 10Stang) [07:36:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 67ce6ce: zhwikisource: Add NS100 to wgNamespacesToBeSearchedDefault (T308393) (duration: 00m 50s) [07:36:46] koi: and, synced [07:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:48] T308393: Add Portal namespace to $wgNamespacesToBeSearchedDefault on zhwikisource - https://phabricator.wikimedia.org/T308393 [07:38:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:38:39] (03Merged) 10jenkins-bot: thwikibooks: Add more namespaces to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791749 (https://phabricator.wikimedia.org/T308373) (owner: 10Stang) [07:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:18] koi: the other patch is at mwdebug1001, can you check? [07:39:20] (03PS3) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:39:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:39:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:44] (03PS2) 10Urbanecm: thwikibooks: Enable quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791718 (https://phabricator.wikimedia.org/T308377) (owner: 10Stang) [07:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:47] (03CR) 10Urbanecm: [C: 03+2] thwikibooks: Enable quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791718 (https://phabricator.wikimedia.org/T308377) (owner: 10Stang) [07:39:55] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:40:04] (03PS11) 10Sergio Gimeno: GrowthExperiments: Update campaigns configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [07:40:22] checked and looks great [07:40:29] (03PS4) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:40:37] syncing, thanks [07:40:40] (03Merged) 10jenkins-bot: thwikibooks: Enable quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791718 (https://phabricator.wikimedia.org/T308377) (owner: 10Stang) [07:41:00] urbanecm: Hello, is there room for one more config change in the current window? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/790650 [07:41:04] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:41:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3e04f86: thwikibooks: Add more namespaces to wgNamespacesToBeSearchedDefault (T308373) (duration: 00m 48s) [07:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:32] T308373: Add some namespaces to search results for thwikibooks - https://phabricator.wikimedia.org/T308373 [07:41:34] sergi0: 'morning, sure thing. can you add it to the wikitech calendar please? [07:42:00] koi: and, live. [07:42:16] koi: quiz extension patch is at mwdebug1001, can you check? [07:42:22] 10SRE, 10SRE-Access-Requests, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10jnuche) 05Resolved→03Open Reopening to try to get some feedback. [07:42:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:57] checked, extension installed [07:43:03] (03PS2) 10Urbanecm: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791752 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [07:43:07] (03CR) 10Urbanecm: [C: 03+2] ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791752 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [07:43:17] koi: thanks. lets sync then [07:43:24] urbanecm: done! [07:43:44] ok, will ping you once testable [07:44:04] (03Merged) 10jenkins-bot: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791752 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [07:44:16] thank you! sorry for the short notice. [07:44:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 57d4a9c: thwikibooks: Enable quiz extension (T308377) (duration: 00m 48s) [07:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:28] T308377: Enable the quiz extension on thwikibooks - https://phabricator.wikimedia.org/T308377 [07:44:40] sergi0: np, it happens [07:45:06] koi: and the mediasearch patch is at mwdebug1001 as well, can you check? [07:45:12] looking [07:45:15] (03PS12) 10Urbanecm: GrowthExperiments: Update campaigns configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [07:45:31] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Update campaigns configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [07:46:06] (03PS5) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:46:16] (03Merged) 10jenkins-bot: GrowthExperiments: Update campaigns configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [07:46:41] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:47:26] looks good [07:47:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:45] thanks koi, syncing [07:48:04] (03PS6) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:48:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:48:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:38] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:49:04] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: dc82dfa8: ptwikinews: Enable extension MediaSearch (T299872) (duration: 00m 48s) [07:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] T299872: enable MediaSearch extension on ptwikinews - https://phabricator.wikimedia.org/T299872 [07:49:09] koi: and live [07:49:22] sergi0: your patch is at mwdebug1001, can you check? [07:49:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:40] sure [07:51:17] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:52:15] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:52] urbanecm: looking good. Ty! [07:53:04] sergi0: thanks, syncing [07:53:21] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 2.38 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [07:53:50] (03PS7) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:54:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e9a00e8: GrowthExperiments: Update campaigns configuration (T305443, T305659, T307521) (duration: 00m 50s) [07:54:28] sergi0: and, live. [07:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:29] T305659: Account creation: Thank You Page landing pages - https://phabricator.wikimedia.org/T305659 [07:54:29] T305443: Account creation: remove GLAM event ad-hoc code after 20th of April - https://phabricator.wikimedia.org/T305443 [07:54:29] T307521: Support templating for Growth campaign landing pages - https://phabricator.wikimedia.org/T307521 [07:54:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:43] urbanecm: Cool! Thank you :) [07:56:51] happy to help :) [07:58:03] (03PS8) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [07:58:07] !log UTC morning B&C window done [07:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] (03CR) 10jerkins-bot: [V: 04-1] Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:58:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:58:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:14] ACKNOWLEDGEMENT - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T307121 - The acknowledgement expires at: 2022-05-23 07:59:02. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:31] (03PS9) 10Slyngshede: Move query service cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) [08:02:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35266/console" [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:02:19] (03PS2) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [08:02:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:47] (03CR) 10jerkins-bot: [V: 04-1] Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [08:06:13] (03CR) 10Ayounsi: [C: 03+1] cumin: use homer ssh config for lsw devices [puppet] - 10https://gerrit.wikimedia.org/r/791328 (owner: 10Volans) [08:07:10] (03CR) 10Sergio Gimeno: [C: 04-1] "As suggested in slack (https://wikimedia.slack.com/archives/GVD7X37RB/p1652677200197739) the work here was squashed in I7f8401533c02f7c4c6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791770 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza) [08:19:00] (03PS6) 10Filippo Giunchedi: netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [08:19:02] (03PS8) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [08:23:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35267/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:23:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: use gitlab1003 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/791599 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [08:27:07] (03PS3) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [08:33:45] (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:36:15] ^thats new gitlab1003 hosts, taking a look [08:37:42] (03PS1) 10Slyngshede: Move Prometheus postgresql lag metric collector to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792106 (https://phabricator.wikimedia.org/T273673) [08:38:03] (03PS1) 10Jelto: acme_chief: add new gitlab hosts to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) [08:38:16] (03CR) 10jerkins-bot: [V: 04-1] Move Prometheus postgresql lag metric collector to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792106 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:38:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:58] (03PS2) 10Slyngshede: Move Prometheus postgresql lag metric collector to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792106 (https://phabricator.wikimedia.org/T273673) [08:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:44:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35268/console" [puppet] - 10https://gerrit.wikimedia.org/r/792106 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:49:36] (03CR) 10DCausse: [C: 03+1] elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:49:41] (03PS1) 10Jelto: aptrepo: import gitlab package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) [08:51:16] (03PS1) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [08:53:07] (03CR) 10Jelto: "@Moritz can you take a look here? I'm not sure if it makes sense to import the same package for bullseye similar to gitlab-runner. I also " [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [08:54:09] (03CR) 10Vgutierrez: [C: 03+1] sre: add varnish/haproxy availability pages [alerts] - 10https://gerrit.wikimedia.org/r/789575 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:59:21] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35269/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:00:20] (03CR) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:06:32] (03PS2) 10David Caro: wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 [09:06:34] (03PS1) 10David Caro: wmcs-k8s-node-upgrade: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/792112 [09:07:00] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:27] (03CR) 10jerkins-bot: [V: 04-1] wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 (owner: 10David Caro) [09:07:53] (03PS1) 10Slyngshede: Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [09:08:16] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:09:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/791596 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:13:32] (03CR) 10Jbond: [C: 03+1] etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [09:13:49] (03CR) 10Jbond: P:etcd::tlsproxy: move to cfssl pki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [09:14:04] (03PS3) 10David Caro: wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 [09:14:20] (03PS2) 10Slyngshede: Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [09:15:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35271/console" [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:22:03] (03CR) 10Hashar: C:helm: make the group permissions on helm_cache configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [09:23:55] (03CR) 10Jbond: [C: 04-1] "This is actually the cert that is being used in production (along with an old CA cert). We should update it/switch it to pki first" [puppet] - 10https://gerrit.wikimedia.org/r/791677 (owner: 10Dzahn) [09:25:31] (03CR) 10Jbond: "LGTM but best to collect a +1 from traffic on this one" [puppet] - 10https://gerrit.wikimedia.org/r/791678 (owner: 10Dzahn) [09:26:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:27:08] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:29:55] (03PS1) 10Slyngshede: Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) [09:31:48] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791770 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza) [09:32:27] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:34:59] (03PS2) 10Jbond: C:helm: make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) [09:35:03] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [09:35:35] (03PS2) 10Slyngshede: Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) [09:35:47] (03CR) 10Jcrespo: [C: 04-1] "Not against the movement, but in the current environment, the backup automation returns a failure every time a single backup of the batch " [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:38:02] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [09:40:28] (03PS2) 10Jelto: acme_chief: add new gitlab hosts to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) [09:42:12] (03CR) 10Jelto: acme_chief: add new gitlab hosts to acl for gitlab certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:44:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35272/console" [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:46:31] (03PS1) 10Jbond: puppetdbquery: remove module [puppet] - 10https://gerrit.wikimedia.org/r/792117 [09:47:22] (03CR) 10Jbond: [C: 03+1] acme_chief: add new gitlab hosts to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:49:03] (03CR) 10Ladsgroup: [C: 03+1] Export exim queue length from mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:58:25] (03CR) 10David Caro: "Got a question (if the answer is yes feel free to merge), the rest are nits you can ignore." [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:59:53] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10hnowlan) restbase1026 can be moved with a few minutes notice without impact. Only requirement is that it stay in a D rack, as stated [10:00:59] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/791667 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [10:01:56] (03CR) 10Jelto: [C: 03+2] acme_chief: add new gitlab hosts to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/792107 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:01:58] (03PS1) 10Slyngshede: Move l10nupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [10:02:29] (03CR) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc (032 comments) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [10:02:49] (03CR) 10jerkins-bot: [V: 04-1] Move l10nupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:03:21] (03CR) 10Volans: [C: 03+2] cumin: use homer ssh config for lsw devices [puppet] - 10https://gerrit.wikimedia.org/r/791328 (owner: 10Volans) [10:09:26] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:10:44] (03PS1) 10Ladsgroup: RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791745 (https://phabricator.wikimedia.org/T308207) [10:11:23] jouncebot: nowandnext [10:11:24] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [10:11:24] In 1 hour(s) and 48 minute(s): New wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1200) [10:11:30] (03CR) 10Ladsgroup: [C: 03+2] RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791745 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [10:13:36] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:16:50] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:28:11] (03Merged) 10jenkins-bot: RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791745 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [10:31:04] (03PS1) 10Ladsgroup: RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792126 (https://phabricator.wikimedia.org/T308207) [10:31:09] (03CR) 10Ladsgroup: [C: 03+2] RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792126 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [10:32:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:34:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:23] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:37:57] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:38:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:06] (03CR) 10David Caro: "There's a few other places that have the name hardcoded:" [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [10:40:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:21] (03PS1) 10Volans: cluster::management: backup auditing logs [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) [10:44:37] (03PS2) 10Slyngshede: Move l10nupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [10:45:30] (03PS2) 10Hnowlan: Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846 [10:50:27] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:54:49] (03PS3) 10Slyngshede: Remove unused l10nupdate class. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [10:55:16] (03Merged) 10jenkins-bot: RestrictionStore: Add support for templatelinks migration [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792126 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [10:57:26] (03CR) 10Jbond: [C: 04-1] "We still have uses of query_resources" [puppet] - 10https://gerrit.wikimedia.org/r/792117 (owner: 10Jbond) [10:57:58] !log test HAProxy 2.4.17 on cp4026 and cp4032 [10:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:40] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:04] (03CR) 10Jbond: [C: 03+2] requestctl_checkip: Addressing post-merge optimisation comments [puppet] - 10https://gerrit.wikimedia.org/r/791313 (owner: 10Jbond) [11:02:43] (03PS1) 10Sergio Gimeno: GrowthExperiments: Update campaigns benefit list config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792149 (https://phabricator.wikimedia.org/T305659) [11:03:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:03:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:02] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:04:08] (03PS1) 10Ayounsi: Clean up local IDE errors/warnings/diffs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792151 [11:05:04] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792151 (owner: 10Ayounsi) [11:06:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:39] (03PS1) 10Slyngshede: Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [11:08:56] this sync might trigger a bit of errors but it should clear up quickly [11:09:15] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes: Backport: [[gerrit:792126|RestrictionStore: Add support for templatelinks migration (T308207)]] (duration: 00m 54s) [11:09:19] (03CR) 10jerkins-bot: [V: 04-1] Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:20] T308207: ApiQueryInfo::getProtectionInfo is slow on normalized templatelinks - https://phabricator.wikimedia.org/T308207 [11:09:24] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 3 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10jbond) >>! In T308350#7928087, @thcipriani wrote: > Sounds good from from my side: seems an... [11:10:16] (03PS2) 10Slyngshede: Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [11:10:18] (03PS1) 10Ayounsi: wmf-netbox: convert format to f-string [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792156 [11:10:50] (03CR) 10Ayounsi: "Tested with a diff in *drmrs*: NOOP." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792156 (owner: 10Ayounsi) [11:11:07] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Clean up local IDE errors/warnings/diffs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792151 (owner: 10Ayounsi) [11:11:29] (03PS2) 10Sergio Gimeno: GrowthExperiments: Update campaigns benefit list config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792149 (https://phabricator.wikimedia.org/T305659) [11:12:08] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792156 (owner: 10Ayounsi) [11:14:55] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:21:21] (03PS3) 10Slyngshede: Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [11:22:22] (03CR) 10jerkins-bot: [V: 04-1] Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:24:22] (03PS4) 10Slyngshede: Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [11:25:28] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:25:57] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: convert format to f-string [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792156 (owner: 10Ayounsi) [11:26:55] !log asw2-ulsfo fix MTU on 2 interfaces [11:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] (03PS5) 10Slyngshede: Move Carbon Cache log cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [11:30:20] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [11:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:25] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35279/console" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:31:01] (03PS2) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) [11:31:10] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:31:14] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] (03PS1) 10Ladsgroup: dbbackups: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792171 (https://phabricator.wikimedia.org/T308013) [11:39:00] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:40:32] jouncebot: next [11:40:33] In 0 hour(s) and 19 minute(s): New wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1200) [11:40:40] (03PS1) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [11:40:42] (03PS1) 10Ladsgroup: dbtree: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) [11:40:44] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) Added the proposed node labels to ml-serve-eqiad via T308418#7930118. At this point I'll wait to see what strategy is be... [11:40:54] (03PS1) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) [11:40:58] (03PS1) 10Ladsgroup: proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) [11:41:02] (03PS1) 10Ladsgroup: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) [11:41:06] (03PS3) 10Urbanecm: Initial configuration for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791385 (https://phabricator.wikimedia.org/T305279) [11:41:22] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:41:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35280/console" [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:47:56] (03Abandoned) 10Ayounsi: wmf-netbox: refactor _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/780304 (owner: 10Ayounsi) [11:48:02] (03CR) 10Jbond: [C: 04-1] "-1: see comments" [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:48:22] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:48:24] (03PS1) 10Ayounsi: wmf-netbox: only return the MTU if the interface is enabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792181 [11:48:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:48:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792171 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:49:44] (03CR) 10Ladsgroup: mediabackup: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:50:18] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:55:14] (03PS2) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [11:55:49] (03CR) 10Jbond: [C: 04-1] mediabackup: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [11:57:37] (03PS3) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [11:58:40] (03PS1) 10Slyngshede: Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) [11:58:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35282/console" [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:59:28] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1081.eqiad.wmnet with reason: T308267 [11:59:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1081.eqiad.wmnet with reason: T308267 [11:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] T308267: RAID battery malfunction in an-worker1081 - https://phabricator.wikimedia.org/T308267 [11:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Urbanecm and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for New wiki creation . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1200). [12:00:23] (03CR) 10jerkins-bot: [V: 04-1] Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:01:23] (03CR) 10Ayounsi: "NOOP in ulsfo." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792181 (owner: 10Ayounsi) [12:01:25] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:01:39] o/ [12:01:55] (03PS2) 10Slyngshede: Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) [12:02:22] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jcrespo) Could you expand on why Apache 2 specifically (e.g. vs MIT or BSD?)- is it because trademarks? [12:02:42] (03CR) 10jerkins-bot: [V: 04-1] Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:03:14] Amir1: hey, I'm here [12:03:16] let's start? [12:03:31] (03PS3) 10Slyngshede: Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) [12:04:32] sure [12:05:10] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:05:43] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791385 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [12:06:37] (03PS1) 10Ayounsi: wmf-netbox: create _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 [12:06:58] (03Merged) 10jenkins-bot: Initial configuration for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791385 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [12:07:11] (03CR) 10Jcrespo: [C: 03+1] "Looks good- although one question- should we rename the fileset as simply "logs", as in the future we may want to add other ones? E.g. we " [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [12:07:29] (03PS1) 10Jbond: rake spdx: update regex to match full string [puppet] - 10https://gerrit.wikimedia.org/r/792188 [12:07:57] (03CR) 10Ayounsi: "Riccardo, not sure I understand your comment from https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/780304/1/plugins/wmf" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 (owner: 10Ayounsi) [12:08:01] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add varnish/haproxy availability pages [alerts] - 10https://gerrit.wikimedia.org/r/789575 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:08:06] (03PS2) 10Filippo Giunchedi: sre: add varnish/haproxy availability pages [alerts] - 10https://gerrit.wikimedia.org/r/789575 (https://phabricator.wikimedia.org/T305847) [12:08:10] (03CR) 10Jbond: [C: 04-1] mediabackup: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [12:08:21] (03CR) 10Jbond: [C: 03+2] rake spdx: update regex to match full string [puppet] - 10https://gerrit.wikimedia.org/r/792188 (owner: 10Jbond) [12:08:43] addwiki.php ran, db created in right place [12:09:22] and wiki's up, syncing [12:10:50] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating kcgwiki (T305279) (duration: 00m 49s) [12:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:56] T305279: Create Wikipedia Tyap - https://phabricator.wikimedia.org/T305279 [12:11:05] \o/ [12:11:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:41] !log urbanecm@deploy1002 Synchronized dblists: Creating kcgwiki (T305279) (duration: 00m 50s) [12:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:22] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:12:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:57] (03PS1) 10Jbond: rake spdx: remove .. ahem .. debugging [puppet] - 10https://gerrit.wikimedia.org/r/792189 [12:13:08] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating kcgwiki (T305279) [12:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:19] (03PS2) 10Jbond: rake spdx: remove .. ahem .. debugging [puppet] - 10https://gerrit.wikimedia.org/r/792189 [12:13:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:13:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake spdx: remove .. ahem .. debugging [puppet] - 10https://gerrit.wikimedia.org/r/792189 (owner: 10Jbond) [12:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:58] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating kcgwiki (T305279) (duration: 00m 49s) [12:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating kcgwiki (T305279) (duration: 00m 49s) [12:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:58] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:15:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating kcgwiki (T305279) (duration: 00m 48s) [12:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:13] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792190 (https://phabricator.wikimedia.org/T305279) [12:18:15] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792190 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [12:19:04] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792190 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [12:21:20] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:35] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 00m 49s) [12:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35283/console" [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:22:27] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:22:50] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:23:04] Amir1: i guess that's all? [12:23:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:23:28] Assume so [12:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:52] in that case, thanks for standing by. [12:23:57] Cheers [12:24:00] (03PS2) 10Btullis: Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462) [12:24:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:24:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:47] (03PS3) 10Btullis: Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462) [12:26:09] (03CR) 10Ladsgroup: rake spdx: update regex to match full string (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/792188 (owner: 10Jbond) [12:26:12] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:27:18] (03PS1) 10Stang: yiwiktionary: Update desktop logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792192 (https://phabricator.wikimedia.org/T308411) [12:27:49] (03PS4) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [12:28:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:34] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [12:30:29] (03PS1) 10Ayounsi: wmf-netbox: prefix disabled interfaces with DISABLED [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792193 [12:30:46] (03CR) 10Volans: [C: 03+1] "makes sense to me" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792181 (owner: 10Ayounsi) [12:30:58] (03CR) 10Volans: wmf-netbox: create _get_junos_interfaces (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 (owner: 10Ayounsi) [12:32:17] (03CR) 10Volans: [C: 03+1] "makes sense to me" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792193 (owner: 10Ayounsi) [12:33:39] 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) [12:35:01] (03PS5) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [12:35:45] (03CR) 10Btullis: [C: 03+2] Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [12:35:53] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:36:04] (03CR) 10Jbond: [C: 03+2] rake spdx: update regex to match full string (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/792188 (owner: 10Jbond) [12:37:20] (03CR) 10Jcrespo: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [12:37:22] (03PS6) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [12:37:30] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:38:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35286/console" [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:38:53] (03CR) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [12:39:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:40:00] (03PS1) 10Stang: hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) [12:40:24] (03Merged) 10jenkins-bot: Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [12:41:03] (03CR) 10jerkins-bot: [V: 04-1] hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [12:42:23] (03PS1) 10Stang: yiwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792197 (https://phabricator.wikimedia.org/T308411) [12:43:07] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:43:09] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [12:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:34] (03PS2) 10Stang: hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) [12:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:45:42] (03PS1) 10Btullis: Bump the datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/792200 [12:49:56] 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) a:05BTullis→03None [12:51:05] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:54:02] (03CR) 10Btullis: [C: 03+2] Bump the datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/792200 (owner: 10Btullis) [12:56:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for this, voting +1 to not block the change. Though see inline for a realization I had and how it can lead to a simpler solution" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:58:10] (03PS1) 10Jbond: O:netbox::standalone: move netbox-next to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/792201 (https://phabricator.wikimedia.org/T296452) [12:59:37] (03Merged) 10jenkins-bot: Bump the datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/792200 (owner: 10Btullis) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1300). Please do the needful. [13:00:04] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] hi [13:00:56] (03PS7) 10Jbond: P:netbox: Add support for cfssl PKI [puppet] - 10https://gerrit.wikimedia.org/r/792174 (https://phabricator.wikimedia.org/T296452) [13:01:15] (03PS2) 10Jbond: O:netbox::standalone: move netbox-next to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/792201 (https://phabricator.wikimedia.org/T296452) [13:02:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35288/console" [puppet] - 10https://gerrit.wikimedia.org/r/792201 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:02:20] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:03:08] (03CR) 10Jbond: O:netbox::standalone: move netbox-next to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/792201 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:06:14] I can deploy [13:07:48] (03PS2) 10Lucas Werkmeister (WMDE): thwikibooks: Enable babel categorize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791717 (https://phabricator.wikimedia.org/T308378) (owner: 10Stang) [13:08:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] thwikibooks: Enable babel categorize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791717 (https://phabricator.wikimedia.org/T308378) (owner: 10Stang) [13:09:17] (03Merged) 10jenkins-bot: thwikibooks: Enable babel categorize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791717 (https://phabricator.wikimedia.org/T308378) (owner: 10Stang) [13:09:28] (03PS2) 10Ayounsi: wmf-netbox: create _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 [13:09:30] (03PS2) 10Ayounsi: wmf-netbox: prefix disabled interfaces with DISABLED [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792193 [13:10:49] koi: the first patch (babel) should be on mwdebug1001, can you test it? [13:10:56] looking [13:12:36] (03PS2) 10Lucas Werkmeister (WMDE): thwikibooks: Add NS 104 and 106 to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791722 (https://phabricator.wikimedia.org/T308376) (owner: 10Stang) [13:13:40] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:13:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] hmm, I'm not sure: will those cat be created automatically, or should have to run some maint script [13:14:26] not sure [13:14:34] are the category pages supposed to be created by the software? [13:14:35] (03CR) 10Volans: "The diff between PS1 and PS2 looks good to me. The general logic seems ok, but I lack context on the whole consequences on the templates f" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 (owner: 10Ayounsi) [13:14:49] I would have thought you need to purge an affected user page on mwdebug and then it’ll be categorized [13:14:56] but the creation of the category pages themselves might be left to users [13:14:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:59] (no idea if that’s true though) [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:43] well, there do exist a user to automatically create such cat [13:16:45] https://meta.wikimedia.org/wiki/User:Babel_AutoCreate [13:16:50] hm, [13:16:52] I see [13:17:12] but it is considered mulfunction and blocked on many sites :( [13:17:34] I don’t see any maintenance scripts in the Babel extension [13:19:04] oh, I see this patch work [13:19:13] so LGTM, ok to sync [13:19:30] indeed, seems to work on https://th.wikibooks.org/wiki/%E0%B8%9C%E0%B8%B9%E0%B9%89%E0%B9%83%E0%B8%8A%E0%B9%89:Geonuch after a purge [13:19:40] (apparently it only works for local user pages? I tried it with yours first but that one is global) [13:19:42] ok, syncing [13:20:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] thwikibooks: Add NS 104 and 106 to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791722 (https://phabricator.wikimedia.org/T308376) (owner: 10Stang) [13:20:49] ^ not sure if this one is testable [13:20:55] since search indexing is probably not affected by mwdebug… [13:20:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791717|thwikibooks: Enable babel categorize (T308378)]] (duration: 00m 52s) [13:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:07] T308378: Enable babel categorize on thwikibooks - https://phabricator.wikimedia.org/T308378 [13:21:42] it is, I will try [13:22:02] (content namespace will not be affected by __NOINDEX__) [13:22:13] ok [13:22:25] (03Merged) 10jenkins-bot: thwikibooks: Add NS 104 and 106 to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791722 (https://phabricator.wikimedia.org/T308376) (owner: 10Stang) [13:22:35] I wonder if I should run updateArticleCount after the change or not [13:22:58] koi: change is on mwdebug1001 now [13:23:14] looking [13:24:32] oh, a much easier way to test: page in those NS could be shown when visiting Special:RandomArticle [13:24:48] !log free up space on thanos-be2001 on /var/log/spool/rsyslog [13:24:51] well, easier depending on how many pages there are in each namespace I guess ^^ [13:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:55] depending on the randomness [13:25:03] but good point [13:25:05] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:25:22] well, tested 10 times and see two page in each newly-added ns :) [13:25:42] ok :) [13:26:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:37] (03PS1) 10Ayounsi: msw: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792204 [13:26:39] (03PS1) 10Ayounsi: mr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792205 [13:28:38] syncing [13:28:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:28:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:58] (03Abandoned) 10Ayounsi: Use the new _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/780310 (owner: 10Ayounsi) [13:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:23] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791722|thwikibooks: Add NS 104 and 106 to wgContentNamespaces (T308376)]] (duration: 00m 53s) [13:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:27] T308376: Add some namespaces to $wgContentNamespaces for thwikibooks - https://phabricator.wikimedia.org/T308376 [13:29:49] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript updateArticleCount.php thwikibooks --update # T308376 [basically instantaneous, 1558 articles] [13:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:26] (03CR) 10David Caro: prometheus: refactor prometheus-node-exim-queue (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:31:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:10] (03PS2) 10Lucas Werkmeister (WMDE): thwikibooks: set wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791724 (https://phabricator.wikimedia.org/T308375) (owner: 10Stang) [13:32:17] (03PS3) 10Samtar: InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) [13:33:01] TheresNoTime: 👀 [13:33:22] RECOVERY - puppet last run on thanos-fe2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:33:44] 👀 [13:34:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Seems to be not unusual for Wikibookses. (Wikibooksen? ;))" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791724 (https://phabricator.wikimedia.org/T308375) (owner: 10Stang) [13:34:56] (03PS1) 10Ayounsi: cr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792208 [13:35:01] (03Merged) 10jenkins-bot: thwikibooks: set wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791724 (https://phabricator.wikimedia.org/T308375) (owner: 10Stang) [13:35:20] koi: displaytitle change is on mwdebug1001 [13:35:29] looking [13:36:12] RECOVERY - Disk space on thanos-fe2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-fe2001&var-datasource=codfw+prometheus/ops [13:36:24] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:37:11] yeah, it works in https://th.wikibooks.org/wiki/Project:SB [13:37:22] ok [13:38:07] (03PS1) 10JHathaway: mirrors: remove nginx-common [puppet] - 10https://gerrit.wikimedia.org/r/792209 [13:38:16] (syncing) [13:38:23] (03PS2) 10Lucas Werkmeister (WMDE): thwikibooks: Enable import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791725 (https://phabricator.wikimedia.org/T308374) (owner: 10Stang) [13:38:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791724|thwikibooks: set wgRestrictDisplayTitle to false (T308375)]] (duration: 00m 50s) [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:49] T308375: Set $wgRestrictDisplayTitle on thwikibooks - https://phabricator.wikimedia.org/T308375 [13:40:07] not sure if I can continue the deploy, sorry [13:40:12] if anyone else is around, feel free to take over [13:40:34] (03PS1) 10Ayounsi: commons: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792210 [13:40:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mirrors: remove nginx-common [puppet] - 10https://gerrit.wikimedia.org/r/792209 (owner: 10JHathaway) [13:41:16] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:08] (03CR) 10JHathaway: [C: 03+2] mirrors: remove nginx-common [puppet] - 10https://gerrit.wikimedia.org/r/792209 (owner: 10JHathaway) [13:43:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:42] (03CR) 10Filippo Giunchedi: prometheus: refactor prometheus-node-exim-queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:46:48] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [13:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:23] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [13:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:09] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [13:50:00] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:50:01] (03CR) 10David Caro: [C: 03+1] "LGTM then :)" [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:50:21] !log fix MTUs on asw-b-codfw [13:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:24] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:18] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [13:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:24] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:05:54] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:06:34] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: refactor prometheus-node-exim-queue (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:06:40] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the review David" [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:06:47] (03PS2) 10Filippo Giunchedi: prometheus: refactor prometheus-node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) [14:09:35] 10SRE, 10Data-Engineering: an-worker1081 MEGARAID write cache policy alerting - https://phabricator.wikimedia.org/T308442 (10RhinosF1) [14:10:32] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:12:10] alright, I’m back but the backport window is over [14:12:14] jouncebot: nowandnext [14:12:15] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [14:12:15] In 1 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1530) [14:12:24] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:37] 10SRE, 10Data-Engineering: an-worker1081 MEGARAID write cache policy alerting - https://phabricator.wikimedia.org/T308442 (10BTullis) Sorry, I've already been working on this in {T308267} but I must have allowed the acknowledgement to lapse. I will close this as a duplicate, but add a downtime on that service... [14:12:40] theoretically we have a break now but unless some of those changes are super important I would prefer to just leave them for later [14:12:45] sorry about that koi [14:13:10] that's ok, I put those into next window [14:13:23] ok [14:13:27] 10SRE, 10Data-Engineering: an-worker1081 MEGARAID write cache policy alerting - https://phabricator.wikimedia.org/T308442 (10BTullis) [14:14:46] !log UTC afternoon backport+config window done (just for the record; actual last backport was half an hour ago) [14:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:56] !log bump disk space in prometheus codfw k8s-ml-serve (+30G) [14:14:59] ACKNOWLEDGEMENT - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Working on this in T308267 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:39] jouncebot: nowandnext [14:19:39] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [14:19:39] In 1 hour(s) and 10 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1530) [14:21:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) 05Resolved→03Open Upgrading the VM worked for Turnilo, but Superset needs updating before it will work on Bullseye. Generally there's no gu... [14:21:55] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792136 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [14:24:08] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:28:25] (03PS1) 10Majavah: add several .mailmap entries [puppet] - 10https://gerrit.wikimedia.org/r/792216 [14:30:14] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Majavah) [14:35:32] (03CR) 10Herron: [C: 03+1] Export exim queue length from mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:35:32] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:37:57] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:25] (03Merged) 10jenkins-bot: ApiQueryBacklinksprop: Force the correct templatelinks index on read new [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792136 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [14:38:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Zabe) [14:41:48] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) > Regarding the "fake nodes": I think that could be done with adding the leafs as [[ https://projectcalico.docs.tiger... [14:42:01] !log fix MTUs on asw-c-codfw [14:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792216 (owner: 10Majavah) [14:43:27] (03CR) 10Jbond: [C: 03+2] add several .mailmap entries [puppet] - 10https://gerrit.wikimedia.org/r/792216 (owner: 10Majavah) [14:43:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:44:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:13] !log ladsgroup@deploy1002 scap failed: average error rate on 3/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [14:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:47] weird [14:49:08] lots of notices for “undefined index: indexes” it seems [14:49:16] I guess you are missing the 'if ( !empty( $settings['indexes'] ) ) {' now [14:50:08] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/api/ApiQueryBacklinksprop.php: Backport: Revert: [[gerrit:792136|ApiQueryBacklinksprop: Force the correct templatelinks index on read new (T306673)]] (duration: 00m 50s) [14:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:14] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [14:50:43] (03CR) 10David Caro: [C: 03+1] nova_fullstack_test: abuse the cloud.instance.name field to hold the test VM [puppet] - 10https://gerrit.wikimedia.org/r/791668 (owner: 10Andrew Bogott) [14:52:35] (03PS1) 10Ladsgroup: Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792218 [14:53:06] (03CR) 10David Caro: [C: 03+2] toolsdb: add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/789588 (https://phabricator.wikimedia.org/T301993) (owner: 10Majavah) [14:53:07] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10akosiaris) >>! In T306397#7927205, @Tsevener wrote: > @akosiaris cool, thanks! My instinct is that it feels a bit low - I wonder if pushes are getting dropped somewhere. It would be cool if... [14:53:40] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 4 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:56:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jbond) >>! In T308013#7930591, @jcrespo wrote: > Could you expand on why Apache 2 specifically (e.g. vs MIT or BSD?)- is it because trademarks? See T67270 for discussion,... [14:56:34] (03CR) 10Ladsgroup: [C: 03+2] Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792218 (owner: 10Ladsgroup) [14:59:04] (03CR) 10Ahmon Dancy: [C: 03+1] C:helm: make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [14:59:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jcrespo) >>! In T308013#7931124, @jbond wrote: >>>! In T308013#7930591, @jcrespo wrote: >> Could you expand on why Apache 2 specifically (e.g. vs MIT or BSD?)- is it becau... [15:01:37] (03PS1) 10Ayounsi: Remove all mentions of netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/792220 (https://phabricator.wikimedia.org/T296452) [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:03:34] zabe: yup, made a patch for it: https://gerrit.wikimedia.org/r/792221 [15:05:10] (03PS1) 10Andrew Bogott: nova_fullstack: place VM hostnames in lables['test_hostname'] [puppet] - 10https://gerrit.wikimedia.org/r/792223 [15:05:54] Thanks! [15:06:02] (03CR) 10Jcrespo: [C: 03+2] prometheus: Avoid warnings on the mysqld exporter config generator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791580 (owner: 10Jcrespo) [15:06:06] yw [15:07:00] (03CR) 10David Caro: [C: 03+1] nova_fullstack: place VM hostnames in lables['test_hostname'] [puppet] - 10https://gerrit.wikimedia.org/r/792223 (owner: 10Andrew Bogott) [15:07:02] (03CR) 10Jcrespo: [C: 03+1] proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [15:09:02] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:13] (03CR) 10Jcrespo: [C: 03+1] "For context, this is a very basic module I started from scratch doing for mw multi-dc work, and was used for testing, but I didn't finish " [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [15:10:01] (03CR) 10Jcrespo: [C: 03+2] "I will merge this now." [puppet] - 10https://gerrit.wikimedia.org/r/792171 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [15:10:34] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: only return the MTU if the interface is enabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792181 (owner: 10Ayounsi) [15:10:48] (03Abandoned) 10Stang: commonswiki: Add *.toolforge.org to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) (owner: 10Stang) [15:12:09] (03CR) 10Ayounsi: "This one and parents were tested locally with Ie157d2687c7d202cff53b7a599de56a9dbd25e90 and childs." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792193 (owner: 10Ayounsi) [15:12:19] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: create _get_junos_interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792186 (owner: 10Ayounsi) [15:12:32] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: prefix disabled interfaces with DISABLED [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792193 (owner: 10Ayounsi) [15:13:40] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:15:06] (03CR) 10David Caro: [C: 03+2] P:openstack::puppetmaster: add 8143 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/788761 (owner: 10Majavah) [15:16:09] (03Merged) 10jenkins-bot: Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792218 (owner: 10Ladsgroup) [15:16:13] (03CR) 10David Caro: [C: 03+2] P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:16:32] (03PS5) 10David Caro: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:16:40] (NodeTextfileStale) resolved: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:17:11] (03CR) 10Razzi: [C: 03+2] dhcpd: downgrade an-tool1005 to stretch, upgrade an-tool1007 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/791705 (https://phabricator.wikimedia.org/T301990) (owner: 10Razzi) [15:18:33] !log rebooting pfw3[a-b]-eqiad for Junos upgrade [15:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:20:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64700/IPv4: Connect - frack-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64700/IPv4: Connect - frack-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:21] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update homer wmf-netbox plugin - ayounsi@cumin1001 [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792220 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:23:36] (03PS6) 10Majavah: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) [15:23:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:23:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update homer wmf-netbox plugin - ayounsi@cumin1001 [15:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:24:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35289/console" [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:44] (03CR) 10David Caro: [C: 03+2] P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:25:54] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64700/IPv4: Idle - frack-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:28:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:28:49] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Export exim queue length from mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:29:05] (03PS2) 10Filippo Giunchedi: Export exim queue length from mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) [15:29:12] (03CR) 10Ayounsi: [C: 03+2] Remove all mentions of netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/792220 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:29:14] (03PS1) 10Andrew Bogott: nova_fullstack: annotate leaked VMs with the failure reason [puppet] - 10https://gerrit.wikimedia.org/r/792226 [15:29:26] (03PS2) 10Ayounsi: Remove all mentions of netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/792220 (https://phabricator.wikimedia.org/T296452) [15:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1530). [15:30:19] (03PS1) 10Jbond: P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) [15:31:24] (03CR) 10jerkins-bot: [V: 04-1] P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [15:35:27] (03PS2) 10Jbond: P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) [15:38:41] (03PS3) 10Jbond: P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) [15:38:48] (03PS1) 10David Caro: Revert "P:openstack::encapi: add keystone token verification" [puppet] - 10https://gerrit.wikimedia.org/r/792228 (https://phabricator.wikimedia.org/T274666) [15:39:05] (03CR) 10Majavah: [C: 03+1] Revert "P:openstack::encapi: add keystone token verification" [puppet] - 10https://gerrit.wikimedia.org/r/792228 (https://phabricator.wikimedia.org/T274666) (owner: 10David Caro) [15:39:06] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netbox2001-dev.wikimedia.org [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:27] (03CR) 10David Caro: [C: 03+2] Revert "P:openstack::encapi: add keystone token verification" [puppet] - 10https://gerrit.wikimedia.org/r/792228 (https://phabricator.wikimedia.org/T274666) (owner: 10David Caro) [15:39:35] (03CR) 10Jdlrobson: [C: 03+1] bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya) [15:39:38] (03PS5) 10Jdlrobson: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya) [15:39:59] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792229 (https://phabricator.wikimedia.org/T128546) [15:40:59] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792229 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:41:06] (03PS4) 10Jbond: P:swift::storage: Use disk_type to identify swift disks [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) [15:41:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35295/console" [puppet] - 10https://gerrit.wikimedia.org/r/792227 (https://phabricator.wikimedia.org/T300057) (owner: 10Jbond) [15:42:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792229 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:42:11] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:29] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netbox2001-dev.wikimedia.org [15:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:00] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exim-queue.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:07] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:792229| Bumping portals to master (T128546)]] (duration: 00m 50s) [15:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:46:59] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:792229| Bumping portals to master (T128546)]] (duration: 00m 51s) [15:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:10] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netbox-dev2001.wikimedia.org [15:47:10] I'll take a look on mx2001 [15:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] (03CR) 10David Caro: [C: 04-1] "We might have a fix, things are temporarily working for now, will merge if we can't keep them working or abandon if we find a permanent fi" [puppet] - 10https://gerrit.wikimedia.org/r/792228 (https://phabricator.wikimedia.org/T274666) (owner: 10David Caro) [15:49:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:50:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:56] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:01] 10SRE, 10SRE-Access-Requests, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10jnuche) 05Open→03Resolved It's working now. The Keyholder proxy needed to be restarted on the deployment servers :) Thank you @Volans and @RLazarus [15:52:33] dcaro: I knew I missed sth in the exim-queue refactor :( fails with no frozen messages, fixing shortly [15:53:13] oh, okok [15:53:24] thanks for the heads up [15:53:46] (03PS1) 10Elukey: Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) [15:54:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:54:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:24] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:56:46] (03PS2) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847) [15:56:48] (03PS1) 10Filippo Giunchedi: prometheus: fix no matches for node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/792233 (https://phabricator.wikimedia.org/T305847) [15:58:29] (03Abandoned) 10David Caro: Revert "P:openstack::encapi: add keystone token verification" [puppet] - 10https://gerrit.wikimedia.org/r/792228 (https://phabricator.wikimedia.org/T274666) (owner: 10David Caro) [15:59:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox-dev2001.wikimedia.org [15:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:51] (03CR) 10David Caro: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/792233 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:36] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix no matches for node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/792233 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:00:42] (03PS2) 10Filippo Giunchedi: prometheus: fix no matches for node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/792233 (https://phabricator.wikimedia.org/T305847) [16:01:04] (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: fix no matches for node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/792233 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:06:50] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) Junos upgrade complete in Eqiad. [16:08:33] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) 05Open→03Resolved [16:16:34] (03PS1) 10Jbond: dhcp: DHCPConfOpt82 media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 [16:17:57] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn I'm just following up to confirm that Advancement approved the plan, so let's p... [16:20:46] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:02] PROBLEM - Host db2083 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:23] jynus, Amir1: expected in any way? ^^^ [16:23:35] S8 replica AFAICT [16:23:46] RECOVERY - Host db2083 is UP: PING OK - Packet loss = 0%, RTA = 35.28 ms [16:23:46] let me check [16:24:38] it didn't reboot at least :D [16:24:48] it's probably nic [16:24:52] let me check [16:24:52] network, switch maybe? [16:27:36] Created T308454 [16:27:37] T308454: db2083 network issue - https://phabricator.wikimedia.org/T308454 [16:28:22] my guess is loose cable [16:28:40] could be- sending a check for papaul maybe? [16:29:08] there was maintenance on some switches, but I think it was on eqiad [16:31:28] 10ops-codfw, 10DBA: db2083 network issue - https://phabricator.wikimedia.org/T308454 (10Ladsgroup) It might be a loose cable. Can you check please? [16:33:18] Amir1: is it back up? [16:33:32] papaul: yes but I'm worried it might flap [16:33:37] 10ops-codfw, 10DBA: db2083 network issue - https://phabricator.wikimedia.org/T308454 (10Marostegui) @Papaul can you review its cable? Thanks [16:33:51] papaul: yes, it was down only for 5 minutes [16:33:52] Amir1: will check cable [16:34:11] the worry is if it will happen again- trying to find a cause? [16:34:12] if it's known (e.g. maint ongoing), then ignore it ^^ [16:34:14] if down for 5 mintes it is cable will cehck [16:35:02] (03PS2) 10Andrew Bogott: nova_fullstack: annotate leaked VMs with the failure reason [puppet] - 10https://gerrit.wikimedia.org/r/792226 [16:35:02] Thanks papaul ! [16:39:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:03] (03CR) 104nn1l2: [C: 03+1] votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791797 (https://phabricator.wikimedia.org/T308397) (owner: 10Stang) [16:39:05] (03PS1) 10Jbond: sre.host.pxe: Cookbookk to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [16:39:28] (03PS2) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [16:40:18] (03CR) 10Vgutierrez: [C: 03+1] prometheus: remove http availability pages, moved to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790671 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:40:44] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack: annotate leaked VMs with the failure reason [puppet] - 10https://gerrit.wikimedia.org/r/792226 (owner: 10Andrew Bogott) [16:45:10] (03PS1) 10Zabe: libraryupgrader: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) [16:51:32] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 3 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10thcipriani) >>! In T308350#7930458, @jbond wrote: >>>! In T308350#7928087, @thcipriani wrot... [16:52:59] jouncebot: nowandnext [16:52:59] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [16:52:59] In 0 hour(s) and 7 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1700) [16:58:14] 10ops-codfw, 10DBA: db2083 network issue - https://phabricator.wikimedia.org/T308454 (10Papaul) 05Open→03Resolved a:03Papaul @Marostegui @Ladsgroup I reseat the cable. If this happen again I will change the cable. Thanks [17:00:04] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T1700). [17:00:10] (03CR) 10jerkins-bot: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [17:00:38] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dmantena) > Before I resolve this ticket, though -- I notice you said you currently have shell access, but I wasn't able to locate your account. > > If you do have SSH access on WMF production se... [17:05:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:10:15] 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10Papaul) [17:10:19] (03CR) 10Volans: "question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [17:10:27] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 3 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10jbond) >>! In T308350#7931549, @thcipriani wrote: >>> @lmata/@MoritzMuehlenhoff Can you ad... [17:10:49] 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10Papaul) p:05Triage→03Medium [17:11:00] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:02] ACKNOWLEDGEMENT - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T307667 [17:24:08] ACKNOWLEDGEMENT - Host an-tool1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T301990 [17:25:30] !log ACKIng again all unhandled CRIT alerts on hosts with "dev" in their name - (imho dev hosts should not have prod CRIT alerts?) [17:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:05] (03CR) 10Dzahn: [C: 04-1] "looks to me like it's used in modules/scap/manifests/master.pp" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [17:31:46] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35296/" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [17:38:53] (03PS4) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [17:43:33] jouncebot: nowandenxt [17:43:39] jouncebot: nowandnext [17:43:40] No deployments scheduled for the next 2 hour(s) and 16 minute(s) [17:43:40] In 2 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T2000) [17:47:16] (03PS1) 10Ladsgroup: Revert "Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new"" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792138 [17:47:28] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new"" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792138 (owner: 10Ladsgroup) [17:55:00] (03PS4) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [17:59:52] (03CR) 10jerkins-bot: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [18:02:39] (03PS5) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [18:05:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new"" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792138 (owner: 10Ladsgroup) [18:06:19] (03Merged) 10jenkins-bot: Revert "Revert "ApiQueryBacklinksprop: Force the correct templatelinks index on read new"" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792138 (owner: 10Ladsgroup) [18:10:17] (03CR) 10BCornwall: "This is just a QoL improvement and not a particularly important change." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [18:10:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:11:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:41] (03PS1) 10Ssingh: cescout: remove obsolete class [puppet] - 10https://gerrit.wikimedia.org/r/792265 [18:14:22] (03PS1) 10Ladsgroup: ApiQueryBacklinksprop: Make sure the index setting exists [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792140 (https://phabricator.wikimedia.org/T306673) [18:14:31] (03CR) 10Ladsgroup: [C: 03+2] ApiQueryBacklinksprop: Make sure the index setting exists [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792140 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [18:15:40] (03PS1) 10Ebernhardson: rdf query service: Apply WARN log level only to com.bigdata [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) [18:16:03] (03PS5) 10Ahmon Dancy: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 [18:19:10] (03PS1) 10Ebernhardson: Revert "cirrus: Turn on AB test of wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792141 [18:19:16] (03CR) 10Ahmon Dancy: mediawiki 0.2.0: Add mw.localmemcached.enabled value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy) [18:22:47] (03PS5) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [18:22:49] (03PS1) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) [18:23:22] (03PS2) 10Ebernhardson: Revert "cirrus: Turn on AB test of wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792141 (https://phabricator.wikimedia.org/T306644) [18:26:42] (03PS2) 10Jbond: dhcp: DHCPConfOpt82 media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 [18:33:24] (03CR) 10Jbond: dhcp: DHCPConfOpt82 media_type parameter (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [18:34:19] (03CR) 10jerkins-bot: [V: 04-1] ApiQueryBacklinksprop: Make sure the index setting exists [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792140 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [18:37:40] (03Merged) 10jenkins-bot: ApiQueryBacklinksprop: Make sure the index setting exists [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792140 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [18:37:57] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:42:08] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/api/ApiQueryBacklinksprop.php: Backport: [[gerrit:792140|ApiQueryBacklinksprop: Make sure the index setting exists (T306673)]] (duration: 00m 50s) [18:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:15] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [18:42:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:43:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:36] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10hashar) [18:50:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:42] (03PS1) 10Hashar: admin: add Antoine Musso to Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/792270 (https://phabricator.wikimedia.org/T308478) [18:55:01] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10hashar) [18:55:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:37] (03CR) 10Hashar: "Approval / process is on the linked task T308478" [puppet] - 10https://gerrit.wikimedia.org/r/792270 (https://phabricator.wikimedia.org/T308478) (owner: 10Hashar) [18:56:32] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10thcipriani) > Team manager is @thcipriani Approved from my side! [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:18:38] (03PS3) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [19:22:12] (03PS1) 10Clare Ming: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) [19:22:14] (03CR) 10jerkins-bot: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [19:23:00] (03CR) 10jerkins-bot: [V: 04-1] Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [19:23:30] (03PS2) 10Clare Ming: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) [19:24:45] (03CR) 10Jdlrobson: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [19:26:59] (03PS3) 10Clare Ming: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) [19:29:26] (03CR) 10Volans: "LGTM, couple of nits/improvements/question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [19:34:17] (03CR) 10Dzahn: [C: 03+1] "looks good! "No hosts found matching `C:cescout`"" [puppet] - 10https://gerrit.wikimedia.org/r/792265 (owner: 10Ssingh) [19:38:44] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:39:24] (03CR) 10Ssingh: [C: 03+2] cescout: remove obsolete class [puppet] - 10https://gerrit.wikimedia.org/r/792265 (owner: 10Ssingh) [19:39:57] (03CR) 10Clare Ming: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [19:42:42] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:45:21] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Thanks @bcampbell sounds good to me! I'll expect to remove the alias tomorrow on our side, aft... [19:47:42] (NodeTextfileStale) resolved: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:57:27] (03PS3) 10Sergio Gimeno: GrowthExperiments: Update campaigns benefit list config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792149 (https://phabricator.wikimedia.org/T305659) [19:57:38] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:58:16] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:00:04] RoanKattouw, Urbanecm, and cjming: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T2000). Please do the needful. [20:00:04] koi, sergi0, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:39] Hi! [20:00:52] Hello [20:01:58] Hello, I can deploy [20:02:52] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Update campaigns benefit list config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792149 (https://phabricator.wikimedia.org/T305659) (owner: 10Sergio Gimeno) [20:04:07] (03Merged) 10jenkins-bot: GrowthExperiments: Update campaigns benefit list config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792149 (https://phabricator.wikimedia.org/T305659) (owner: 10Sergio Gimeno) [20:04:39] sergi0: Your patch is on mwdebug1002, please test [20:04:59] great, testing now [20:05:31] (03PS3) 10Catrope: thwikibooks: Enable import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791725 (https://phabricator.wikimedia.org/T308374) (owner: 10Stang) [20:05:44] (03CR) 10Catrope: [C: 03+2] thwikibooks: Enable import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791725 (https://phabricator.wikimedia.org/T308374) (owner: 10Stang) [20:06:40] (03Merged) 10jenkins-bot: thwikibooks: Enable import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791725 (https://phabricator.wikimedia.org/T308374) (owner: 10Stang) [20:07:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35299/" [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [20:07:14] I could not test this as lack of permission to import [20:07:18] \o [20:07:41] koi: No problem, I can see if the import sources come up [20:08:07] Your patch isn't ready for testing yet though, I'm waiting for Sergio to finish testing his patch first [20:08:28] ok [20:08:41] Ok, no luck, the deploy didn't fix what I was expecting :( [20:09:00] I think it's not due to the config change per-se but code related but you can revert [20:10:26] OK so you want this reverted? [20:10:31] ok, wait, cache issues!! [20:10:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] * RoanKattouw waits [20:10:51] Give me 1 more min [20:11:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:22] Ok, it's fine from my end. I didn't see any errors logged neither [20:12:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:44] Alright, syncing [20:14:52] !log catrope@deploy1002 Synchronized wmf-config: Config: [[gerrit:792149|GrowthExperiments: Update campaigns benefit list config (T305659)]] (duration: 00m 51s) [20:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:57] T305659: Account creation: Thank You Page landing pages - https://phabricator.wikimedia.org/T305659 [20:17:24] koi: Looking at yours now. I also can't test because I also don't have import rights on thwikibooks [20:17:55] hmm, maybe poke some sysadmin or steward? [20:18:21] I gave myself rights temporarily [20:19:14] OK looks like it works [20:20:20] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791725|thwikibooks: Enable import (T308374)]] (duration: 00m 51s) [20:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:26] T308374: Enable import on thwikibooks - https://phabricator.wikimedia.org/T308374 [20:21:40] (03PS2) 10Catrope: yiwiktionary: Update desktop logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792192 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:21:44] (03CR) 10Catrope: [C: 03+2] yiwiktionary: Update desktop logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792192 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:22:06] (03PS3) 10Catrope: hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:22:33] (03Merged) 10jenkins-bot: yiwiktionary: Update desktop logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792192 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:23:11] (03CR) 10Catrope: [C: 03+2] hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:23:18] (03PS2) 10Catrope: yiwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792197 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:23:30] (03CR) 10Catrope: [C: 03+2] yiwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792197 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:24:03] (03Merged) 10jenkins-bot: hewiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792196 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:24:15] (03Merged) 10jenkins-bot: yiwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792197 (https://phabricator.wikimedia.org/T308411) (owner: 10Stang) [20:27:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:28:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:28] !log catrope@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:792192|yiwiktionary: Update desktop logo (T308411)]] (duration: 00m 51s) [20:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:32] T308411: Add localized wordmarks to hewiktionary and yiwiktionary mobile frontend & update the logo of yiwiktionary - https://phabricator.wikimedia.org/T308411 [20:29:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:30] !log catrope@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792192|yiwiktionary: Update desktop logo (T308411)]] (duration: 00m 51s) [20:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:33] koi: The yiwiktionary logo should be updated now, please test [20:33:34] tested and looks fine (desktop one) [20:33:36] !log catrope@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-he.svg: Config: [[gerrit:792196|hewiktionary: Add localized mobile wordmark (T308411)]] (duration: 00m 50s) [20:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:33] !log catrope@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-yi.svg: Config: [[gerrit:792197|yiwiktionary: Add localized mobile wordmark (T308411)]] (duration: 00m 49s) [20:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:38] T308411: Add localized wordmarks to hewiktionary and yiwiktionary mobile frontend & update the logo of yiwiktionary - https://phabricator.wikimedia.org/T308411 [20:36:29] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792197|yiwiktionary: Add localized mobile wordmark (T308411)]] and [[gerrit:792196|hewiktionary: Add localized mobile wordmark (T308411)]] (duration: 00m 50s) [20:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:18] koi: And the he/yi wordmarks should be updated now too [20:37:38] (03PS3) 10Catrope: Revert "cirrus: Turn on AB test of wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792141 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:37:46] (03CR) 10Catrope: [C: 03+2] Revert "cirrus: Turn on AB test of wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792141 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:38:04] (03CR) 10Hashar: [C: 03+1] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [20:38:32] (03Merged) 10jenkins-bot: Revert "cirrus: Turn on AB test of wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792141 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:39:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:39:05] ebernhardson: Your patch is on mwdebug1002, please test (if possible) [20:39:56] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:27] RoanKattouw: looks good [20:41:31] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792141|Revert "cirrus: Turn on AB test of wbsearchentities profiles" (T306644)]] (duration: 00m 51s) [20:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:36] T306644: re-run wbsearchentities optimization process - https://phabricator.wikimedia.org/T306644 [20:42:52] All done! [20:44:37] (03Abandoned) 10Hashar: Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 (owner: 10Hashar) [20:44:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:08] (03PS2) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) [20:45:12] (03PS2) 10Hashar: Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 [20:45:31] (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [20:45:35] (03CR) 10jerkins-bot: [V: 04-1] Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 (owner: 10Hashar) [20:45:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:45:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:41] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RLazarus) 05Open→03Resolved a:03RLazarus Great, thanks! [20:58:15] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RLazarus) [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220516T2100). [21:00:19] ^ no sec patches to deploy today that I'm aware of... [21:05:34] !log gerrit2002 (in setup) - rebooting [21:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:08:08] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:36] (03PS1) 10Dzahn: add gerrit-replica-new secondary/service IPs [dns] - 10https://gerrit.wikimedia.org/r/792281 (https://phabricator.wikimedia.org/T243027) [21:14:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:10] (03CR) 10jerkins-bot: [V: 04-1] add gerrit-replica-new secondary/service IPs [dns] - 10https://gerrit.wikimedia.org/r/792281 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [21:16:10] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:36] (03PS2) 10Dzahn: add gerrit-replica-new secondary/service IPs [dns] - 10https://gerrit.wikimedia.org/r/792281 (https://phabricator.wikimedia.org/T243027) [21:26:23] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:32] (03PS1) 10Zabe: zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792282 (https://phabricator.wikimedia.org/T308013) [21:39:35] (03PS1) 10Zabe: codesearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) [21:41:14] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:44:12] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:57] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:50] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:50] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:53:03] (03CR) 10Dzahn: "Did you use the automatic method?" [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [21:53:35] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) Shipped and arrived via 559967799450, opened 00781129 for the shipment and will go down this week to swap it out. [21:59:31] 10SRE, 10Gerrit, 10serviceops, 10Patch-For-Review: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) >>! In T243027#7732585, @hashar wrote: > To replace the server, we can add the new one as a 2nd replica and have the repositories replicated there (should take a few hours at... [22:01:14] (03Abandoned) 10Dzahn: add gerrit-replica-new secondary/service IPs [dns] - 10https://gerrit.wikimedia.org/r/792281 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [22:04:27] (03CR) 10Zabe: codesearch: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [22:04:29] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) Mortiz, I flashed updates to ganeti4002, but it reminded me I need to go onsite this Thursday for T303318 to swap the defective memory (existing open case before the warranty expired) [22:06:00] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Dzahn) 05Resolved→03Open Hi @Papaul gerrit2001 is in D5 and the new server gerrit2002 is in B5. Due to the way we want to migrate (T243027#7732585) we need a DNS name... [22:07:49] (03Abandoned) 10Dzahn: gitlab: license module files with SPDX-License-Identifier: Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/790743 (https://phabricator.wikimedia.org/T308013) (owner: 10Dzahn) [22:09:58] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:13:02] jhathaway: ^ you seem to be on it? [22:13:24] mutante: yup, on it thanks, sorry will downtime [22:13:37] ok, thanks [22:14:05] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: exim debugging [22:14:06] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: exim debugging [22:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:34] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Sun 03 Jul 2022 10:03:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:17:20] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:37] 10SRE, 10Gerrit, 10serviceops, 10Patch-For-Review: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) reopened T299575 to move the host around [22:38:50] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) @Dzahn any reason this information was not provided to us during the the initial creation of this task and during the installation process? [22:39:32] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Dzahn) @Papaul Yes, it was unknown. [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:32:46] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Sounds good, thanks @Dzahn. I'll follow up here tomorrow when the ITS tasks are done (arou...