[00:12:39] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [00:25:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [00:26:53] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [00:38:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [01:14:27] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:30:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [01:35:41] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:37:45] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:42:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:46:57] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220320T0700) [02:00:05] Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T0200) [02:00:35] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 [02:07:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (owner: 10TrainBranchBot) [02:21:58] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (owner: 10TrainBranchBot) [03:35:43] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:54:49] PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:15:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [04:30:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [04:35:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [04:36:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [04:55:09] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:55:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [04:56:35] RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:00:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:03:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:06:11] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:13:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:25:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:30:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:39:19] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:43:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:43:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:43:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P22858 and previous config saved to /var/cache/conftool/dbconfig/20220321-054358-marostegui.json [05:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:02] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [05:48:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:48:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:48:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22859 and previous config saved to /var/cache/conftool/dbconfig/20220321-054838-marostegui.json [05:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:42] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:50:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:52:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 reimage T300600', diff saved to https://phabricator.wikimedia.org/P22860 and previous config saved to /var/cache/conftool/dbconfig/20220321-055202-marostegui.json [05:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:07] T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600 [05:52:19] !log dbmaint s5@eqiad T300600 [05:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:57] (03PS1) 10Marostegui: db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/772058 (https://phabricator.wikimedia.org/T300600) [05:53:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:53:49] (03CR) 10Marostegui: [C: 03+2] db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/772058 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [05:54:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bullseye [05:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [05:55:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:05:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:07:51] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:19:01] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1175.eqiad.wmnet with OS bullseye [06:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bullseye [06:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:24:30] 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10Marostegui) [06:24:36] 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10Marostegui) p:05Triage→03High [06:43:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22861 and previous config saved to /var/cache/conftool/dbconfig/20220321-064339-marostegui.json [06:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:44] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:43:45] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1175.eqiad.wmnet with OS bullseye [06:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:50:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [06:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22862 and previous config saved to /var/cache/conftool/dbconfig/20220321-065844-marostegui.json [06:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] * kart_ is here [07:00:31] hey! [07:00:33] i can deploy today [07:00:45] urbanecm: cool. Thanks! I'll do some testing. [07:00:59] urbanecm: Please log command like last week. [07:01:00] kart_: I'm not sure how you wish to test tables? [07:01:27] urbanecm: by checking if everything works in Production wrt CX. [07:01:33] okay [07:04:07] kart_: so, I'm going to create `cx_significant_edits` and `cx_section_translation` in the `wikishared` x1 DB, is that correct? [07:04:51] Yes. Correct [07:04:54] doing [07:07:50] kart_: tables created [07:08:09] Thanks! [07:08:29] !log Create `wikishared.cx_significant_edits` and `wikishared.cx_section_translation` at x1 (T302371; `mwscript sql.php --wiki=aawiki --wikidb=wikishared --cluster=extension1 /srv/mediawiki-staging/php-1.38.0-wmf.26/extensions/ContentTranslation/sql/{section-translations,significant-edits}.sql)`) [07:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:33] T302371: Create new tables: cx_significant_edits and cx_section_translation - https://phabricator.wikimedia.org/T302371 [07:09:14] kart_: anything else? [07:09:31] Do we need to give --wiki argument for wikishared? [07:10:06] kart_: yes, because the mwscript wrapper expects one. In this case, it doesn't actually matter much [07:10:28] urbanecm: right. Thanks! [07:10:48] no problem :) [07:10:49] urbanecm: We are good. [07:12:23] great :) [07:12:25] !log UTC morning B&C done [07:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22863 and previous config saved to /var/cache/conftool/dbconfig/20220321-071349-marostegui.json [07:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [07:20:46] (03PS1) 10Urbanecm: Revert "ptwiki: Disable Growth's image recommendation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771923 (https://phabricator.wikimedia.org/T304095) [07:21:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [07:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22864 and previous config saved to /var/cache/conftool/dbconfig/20220321-072854-marostegui.json [07:28:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:28:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [07:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:58] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::rsyslog: add new cabundle paths for omkafka [puppet] - 10https://gerrit.wikimedia.org/r/771905 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:29:00] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298557)', diff saved to https://phabricator.wikimedia.org/P22865 and previous config saved to /var/cache/conftool/dbconfig/20220321-072902-marostegui.json [07:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P22866 and previous config saved to /var/cache/conftool/dbconfig/20220321-073033-root.json [07:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:59] (03Abandoned) 10Elukey: istio: add the install-cni docker file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/767924 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [07:40:43] (03Abandoned) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:40:55] (03PS4) 10Elukey: profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) [07:43:45] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) >>! In T300130#7789376, @colewhite wrote: >>>! In T300130#7788811, @elukey wrote: >> @colewhite I was able to move the deployment-prep's kafka logging host to PKI,... [07:45:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22867 and previous config saved to /var/cache/conftool/dbconfig/20220321-074538-root.json [07:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:04] (03PS3) 10Elukey: Set overlay settings for kubernetes1006 [puppet] - 10https://gerrit.wikimedia.org/r/771602 (https://phabricator.wikimedia.org/T300744) [07:58:31] (03CR) 10Kosta Harlan: [C: 03+1] Revert "ptwiki: Disable Growth's image recommendation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771923 (https://phabricator.wikimedia.org/T304095) (owner: 10Urbanecm) [07:59:04] (03CR) 10Elukey: [C: 03+2] Set overlay settings for kubernetes1006 [puppet] - 10https://gerrit.wikimedia.org/r/771602 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) 🚂🧪Trainsperiment Week Deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T0800). [08:00:30] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22868 and previous config saved to /var/cache/conftool/dbconfig/20220321-080042-root.json [08:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:21] train time! [08:03:26] (KubernetesCalicoDown) firing: kubernetes1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:04:07] this is me --^ [08:05:33] (03PS2) 10Elukey: WIP - initial debianization [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 [08:05:43] hmm. I see a wmf/1.39.0-wmf.1 on mediawiki/core, but it doesn't have the branch commit with submodules added [08:06:17] hashar: jnuche: ^^ did the branching process fail somehow? I think that's in releases-jenkins where I don't have access [08:09:12] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for ssh-gitlab [puppet] - 10https://gerrit.wikimedia.org/r/771362 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:09:57] !log restarting blazegraph on wdqs2004 and wdqs2002 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:42] taavi: just checked the job and it seems it ran fine early today [08:14:34] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22869 and previous config saved to /var/cache/conftool/dbconfig/20220321-081546-root.json [08:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:06] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:16:28] could you post the output log somewhere for me to take a look? [08:16:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [08:17:29] also, https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T0800 says we're deploying things at 8 UTC but https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Trainsperiment_week says 9 UTC [08:17:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298557)', diff saved to https://phabricator.wikimedia.org/P22870 and previous config saved to /var/cache/conftool/dbconfig/20220321-081735-marostegui.json [08:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:39] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:19:07] good morning [08:19:45] morning hashar! [08:19:47] !log installing openssl security updates [08:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:03] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: use alert-specific link to dashboard [puppet] - 10https://gerrit.wikimedia.org/r/771892 (owner: 10Filippo Giunchedi) [08:20:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org [08:21:39] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add dashboard to network probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/771883 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:21:46] hashar: the train branches seem to have been created correctly but we're missing the branch commit on mediawiki/core.git [08:23:39] !log restarting blazegraph on wdqs2003 (stuck for 16 hours) [08:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] taavi: yeah one of the CI job fails due to a npm cache corruption [08:30:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22871 and previous config saved to /var/cache/conftool/dbconfig/20220321-083050-root.json [08:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:48] ahh now I see that too [08:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22872 and previous config saved to /var/cache/conftool/dbconfig/20220321-083240-marostegui.json [08:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:59] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:40:44] dcausse: --^ o/ [08:41:22] elukey: just restarted it a couple minutes ago, this alert should resolve itself "soon" [08:41:49] nice thanks! [08:41:57] sorry just seen the restart in sal [08:43:28] np! :f [08:43:30] :) [08:43:42] !log Train blocked due to a npm checksum mismatch preventing CI from merging in the mediawiki/core 1.39.0-wmf.1 change which create the branch. T304286 [08:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:47] T304286: npm checksum mismatch for ProofreadPage npm dependency: openseadragon - https://phabricator.wikimedia.org/T304286 [08:43:57] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:44:34] (03PS1) 10Muehlenhoff: Add Cumin alias for search-loader [puppet] - 10https://gerrit.wikimedia.org/r/772326 [08:46:42] 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10fgiunchedi) I wholeheartedly agree with the points made here, I'll add that as part of this quarter's work on the `monitoring` section of `service::catalo... [08:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22873 and previous config saved to /var/cache/conftool/dbconfig/20220321-084745-marostegui.json [08:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:04] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/772327 (https://phabricator.wikimedia.org/T135991) [08:59:20] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) [09:00:25] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) We will limit the scope for now of the task so we can have it done now- I think it also makes sense to semi-automatically add new wikis to backups... [09:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298557)', diff saved to https://phabricator.wikimedia.org/P22874 and previous config saved to /var/cache/conftool/dbconfig/20220321-090250-marostegui.json [09:02:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:02:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:55] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:11] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:06:32] sigh..., looking ^ [09:07:29] !log restarting FPM [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:31] hashar: indeed https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/772346 seems to fix the issue [09:08:48] !log restarting blazegraph on wdqs2001 (stuck) [09:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:57] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:10:38] taavi: that is a revert isn't it? [09:10:47] yes [09:10:50] \o/ [09:12:15] (03PS3) 10Elukey: Set overlay settings for kubernetes1015 [puppet] - 10https://gerrit.wikimedia.org/r/771603 (https://phabricator.wikimedia.org/T300744) [09:13:27] can you +2 that? [09:18:29] (03CR) 10Marostegui: [C: 03+2] Enable profile::auto_restarts::service for apache/orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/772327 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:19:04] (03PS1) 10Majavah: Revert "build: Update devDependencies" [extensions/ProofreadPage] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772347 [09:19:32] (03CR) 10Majavah: [C: 03+2] Revert "build: Update devDependencies" [extensions/ProofreadPage] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772347 (owner: 10Majavah) [09:20:40] dcausse: wdqs was alerting yesterday [09:21:48] (03CR) 10Elukey: [C: 03+2] Set overlay settings for kubernetes1015 [puppet] - 10https://gerrit.wikimedia.org/r/771603 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:22:14] RhinosF1: yes saw this, thanks, BlazegraphFreeAllocatorsDecreasingRapidly apparently, restarted 2 instances to fix this and seems to have fixed it since [09:23:39] dcausse: good [09:28:09] (03Abandoned) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [09:32:13] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [09:32:57] !log 1.39.0-wmf.1 train is delayed due to a CI / npm build failure which is being resolved T300203 [09:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:01] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [09:34:34] (03Merged) 10jenkins-bot: Revert "build: Update devDependencies" [extensions/ProofreadPage] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772347 (owner: 10Majavah) [09:35:32] (03PS2) 10Majavah: Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (owner: 10TrainBranchBot) [09:35:48] (03CR) 10Majavah: [C: 03+2] Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (owner: 10TrainBranchBot) [09:37:17] (03PS3) 10Hashar: Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (https://phabricator.wikimedia.org/T300203) (owner: 10TrainBranchBot) [09:37:29] hashar: i already updated the branch commit :/ [09:40:14] (03CR) 10Majavah: [C: 03+2] Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (https://phabricator.wikimedia.org/T300203) (owner: 10TrainBranchBot) [09:41:23] taavi: oh [09:41:44] taavi: jnuche and I are pairing on the train deployment this morning. He is new to releng and this is is onboarding to do train deployment [09:43:27] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Volans) p:05Triage→03Medium [09:46:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:46:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22875 and previous config saved to /var/cache/conftool/dbconfig/20220321-094614-marostegui.json [09:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:18] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:55:25] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.1 [core] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772054 (https://phabricator.wikimedia.org/T300203) (owner: 10TrainBranchBot) [09:56:39] finally :] [10:03:51] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx/htmldumps [puppet] - 10https://gerrit.wikimedia.org/r/772335 (https://phabricator.wikimedia.org/T135991) [10:03:59] train is running! :] [10:07:41] (03Abandoned) 10DCausse: [wdqs] disable fetching constraints [puppet] - 10https://gerrit.wikimedia.org/r/664782 (https://phabricator.wikimedia.org/T274982) (owner: 10DCausse) [10:13:49] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.1 refs T300203 [10:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:53] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [10:15:57] (KubernetesCalicoDown) resolved: kubernetes1015.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:21:56] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:22:50] downtime expired :) --^ [10:24:46] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:38] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:28:13] (03PS1) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772341 [10:28:15] (03PS1) 10Giuseppe Lavagetto: [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 [10:29:06] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:30:00] (03PS2) 10Giuseppe Lavagetto: [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 [10:30:28] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (owner: 10Giuseppe Lavagetto) [10:30:30] (03CR) 10jerkins-bot: [V: 04-1] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772341 (owner: 10Giuseppe Lavagetto) [10:31:26] (03PS1) 10Jcrespo: Initial commit: README, code of conduct and .gitreview [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772343 [10:32:00] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:32:38] (03Abandoned) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772341 (owner: 10Giuseppe Lavagetto) [10:32:42] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (owner: 10Giuseppe Lavagetto) [10:32:46] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Initial commit: README, code of conduct and .gitreview [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772343 (owner: 10Jcrespo) [10:33:22] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:35:12] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22876 and previous config saved to /var/cache/conftool/dbconfig/20220321-103654-marostegui.json [10:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:01] (03PS1) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 [10:37:01] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:37:17] (03Abandoned) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo) [10:37:34] PROBLEM - Ensure local MW versions match expected deployment on mw1447 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:37:48] PROBLEM - Ensure local MW versions match expected deployment on mw1450 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:37:58] (03PS3) 10Elukey: Set overlay settings for kubernetes1016 [puppet] - 10https://gerrit.wikimedia.org/r/771604 (https://phabricator.wikimedia.org/T300744) [10:38:04] (03PS3) 10Giuseppe Lavagetto: [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 [10:38:48] PROBLEM - Ensure local MW versions match expected deployment on mw1418 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:38:58] PROBLEM - Ensure local MW versions match expected deployment on mw1416 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:38:58] <_joe_> what is going on with wikiversions? [10:38:58] PROBLEM - Ensure local MW versions match expected deployment on mw1449 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:39:10] PROBLEM - Ensure local MW versions match expected deployment on mw1448 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:39:14] _joe_: expected [10:39:20] It's happened the last few trains [10:39:48] PROBLEM - Ensure local MW versions match expected deployment on mw1417 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:39:56] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (owner: 10Giuseppe Lavagetto) [10:40:10] (03CR) 10Cparle: [C: 03+1] Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [10:40:28] (03CR) 10Elukey: [C: 03+2] Set overlay settings for kubernetes1016 [puppet] - 10https://gerrit.wikimedia.org/r/771604 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:40:36] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:40:48] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:41:14] PROBLEM - Ensure local MW versions match expected deployment on mw1415 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:41:40] PROBLEM - Ensure local MW versions match expected deployment on mw1313 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:41:42] PROBLEM - Ensure local MW versions match expected deployment on mw1414 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:42:32] PROBLEM - Ensure local MW versions match expected deployment on mw2289 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:42:34] RhinosF1: doesn't seem expected for how the alerting is designed. So either something has changed and the alerting needs tweaking or there is something unexpected IMHO [10:43:00] PROBLEM - Ensure local MW versions match expected deployment on mw1367 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:43:04] volans: yes I fully agree. I've had the same chat with mutante. [10:43:08] PROBLEM - Ensure local MW versions match expected deployment on mw1420 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:43:08] PROBLEM - Ensure local MW versions match expected deployment on mw1412 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:43:14] I think there's a task [10:43:23] <_joe_> there is also a PS [10:43:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:44] (03PS1) 10MMandere: site: Reimage cp1075 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772366 (https://phabricator.wikimedia.org/T290005) [10:43:50] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:43:54] PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:10] <_joe_> volans: yes something changed in scap [10:44:17] (03CR) 10jerkins-bot: [V: 04-1] site: Reimage cp1075 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772366 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:44:40] PROBLEM - Ensure local MW versions match expected deployment on mw1347 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:40] PROBLEM - Ensure local MW versions match expected deployment on mw1318 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:43] ack [10:44:48] PROBLEM - Ensure local MW versions match expected deployment on mw2264 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:48] PROBLEM - Ensure local MW versions match expected deployment on mw2265 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:56] PROBLEM - Ensure local MW versions match expected deployment on mw1332 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:44:58] PROBLEM - Ensure local MW versions match expected deployment on mw2276 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:00] PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:06] PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:10] PROBLEM - Ensure local MW versions match expected deployment on mw2409 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:10] PROBLEM - Ensure local MW versions match expected deployment on mw2367 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:16] PROBLEM - Ensure local MW versions match expected deployment on parse2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:16] PROBLEM - Ensure local MW versions match expected deployment on mw2255 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:28] PROBLEM - Ensure local MW versions match expected deployment on mw1331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:34] _joe_: can you link to it as I've lost it [10:45:36] PROBLEM - Ensure local MW versions match expected deployment on parse2004 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:46] PROBLEM - Ensure local MW versions match expected deployment on mw1316 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:46] PROBLEM - Ensure local MW versions match expected deployment on mw1346 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:46] PROBLEM - Ensure local MW versions match expected deployment on wtp1043 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:48] PROBLEM - Ensure local MW versions match expected deployment on mw2315 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:50] PROBLEM - Ensure local MW versions match expected deployment on mw1366 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:52] <_joe_> RhinosF1: are you going to merge it? [10:45:54] PROBLEM - Ensure local MW versions match expected deployment on mw2254 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:45:56] PROBLEM - Ensure local MW versions match expected deployment on mw2300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:00] PROBLEM - Ensure local MW versions match expected deployment on mw2260 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:04] PROBLEM - Ensure local MW versions match expected deployment on snapshot1012 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:08] PROBLEM - Ensure local MW versions match expected deployment on mw1411 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:20] PROBLEM - Ensure local MW versions match expected deployment on mw2262 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:20] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:28] PROBLEM - Ensure local MW versions match expected deployment on mw1413 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:28] PROBLEM - Ensure local MW versions match expected deployment on wtp1047 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:30] PROBLEM - Ensure local MW versions match expected deployment on wtp1030 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:32] PROBLEM - Ensure local MW versions match expected deployment on mw1335 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:32] PROBLEM - Ensure local MW versions match expected deployment on wtp1029 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:40] PROBLEM - Ensure local MW versions match expected deployment on mw2333 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:40] PROBLEM - Ensure local MW versions match expected deployment on mw2309 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:42] PROBLEM - Ensure local MW versions match expected deployment on mw1424 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:42] PROBLEM - Ensure local MW versions match expected deployment on mw1431 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:42] PROBLEM - Ensure local MW versions match expected deployment on mw2373 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:44] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:46:44] PROBLEM - Ensure local MW versions match expected deployment on mw1319 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:52] PROBLEM - Ensure local MW versions match expected deployment on mw1382 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:54] PROBLEM - Ensure local MW versions match expected deployment on mw1429 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:54] PROBLEM - Ensure local MW versions match expected deployment on mw2337 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:46:54] PROBLEM - Ensure local MW versions match expected deployment on mw2267 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:10] PROBLEM - Ensure local MW versions match expected deployment on mw1433 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:10] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:12] PROBLEM - Ensure local MW versions match expected deployment on mw2405 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:12] PROBLEM - Ensure local MW versions match expected deployment on mw2294 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:18] PROBLEM - Ensure local MW versions match expected deployment on mw1374 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:18] PROBLEM - Ensure local MW versions match expected deployment on wtp1044 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:22] PROBLEM - Ensure local MW versions match expected deployment on wtp1046 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:22] PROBLEM - Ensure local MW versions match expected deployment on mw2323 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:30] PROBLEM - Ensure local MW versions match expected deployment on mw2379 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:31] _joe_: no but for when someone asks next time [10:47:42] PROBLEM - Ensure local MW versions match expected deployment on mw2368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:48] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:52] PROBLEM - Ensure local MW versions match expected deployment on mw1392 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:56] PROBLEM - Ensure local MW versions match expected deployment on mw1435 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:47:56] PROBLEM - Ensure local MW versions match expected deployment on mw1439 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:32] PROBLEM - Ensure local MW versions match expected deployment on mw1394 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:34] PROBLEM - Ensure local MW versions match expected deployment on mw2307 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:44] PROBLEM - Ensure local MW versions match expected deployment on mw1339 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:46] PROBLEM - Ensure local MW versions match expected deployment on mw1350 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:46] PROBLEM - Ensure local MW versions match expected deployment on mw1320 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:46] PROBLEM - Ensure local MW versions match expected deployment on mw1337 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:47] (03PS2) 10MMandere: site: Reimage cp1075 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772366 (https://phabricator.wikimedia.org/T290005) [10:48:54] PROBLEM - Ensure local MW versions match expected deployment on mw2326 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:48:56] PROBLEM - Ensure local MW versions match expected deployment on snapshot1010 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:06] PROBLEM - Ensure local MW versions match expected deployment on mw1441 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:06] PROBLEM - Ensure local MW versions match expected deployment on mw1422 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:06] PROBLEM - Ensure local MW versions match expected deployment on mw1426 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:16] PROBLEM - Ensure local MW versions match expected deployment on mw1443 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:26] PROBLEM - Ensure local MW versions match expected deployment on mw2392 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:36] PROBLEM - Ensure local MW versions match expected deployment on wtp1025 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:48] PROBLEM - Ensure local MW versions match expected deployment on mw1396 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:50] PROBLEM - Ensure local MW versions match expected deployment on mw1342 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:52] PROBLEM - Ensure local MW versions match expected deployment on mw1390 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:49:54] (03PS5) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [10:49:56] PROBLEM - Ensure local MW versions match expected deployment on mw1409 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:02] PROBLEM - Ensure local MW versions match expected deployment on mw2391 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:02] PROBLEM - Ensure local MW versions match expected deployment on mw2335 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:16] PROBLEM - Ensure local MW versions match expected deployment on mw1454 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:46] PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:58] PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:50:58] PROBLEM - Ensure local MW versions match expected deployment on mw2286 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:18] PROBLEM - Ensure local MW versions match expected deployment on mw1403 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:20] PROBLEM - Ensure local MW versions match expected deployment on mw1345 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:20] (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [10:51:22] (03PS5) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 [10:51:26] PROBLEM - Ensure local MW versions match expected deployment on mw1371 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:26] PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:28] PROBLEM - Ensure local MW versions match expected deployment on mw1377 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:32] PROBLEM - Ensure local MW versions match expected deployment on mw1445 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:38] PROBLEM - Ensure local MW versions match expected deployment on mw1311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:40] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:40] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:40] PROBLEM - Ensure local MW versions match expected deployment on mw2314 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:40] PROBLEM - Ensure local MW versions match expected deployment on mw2402 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:40] PROBLEM - Ensure local MW versions match expected deployment on mw2407 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:44] PROBLEM - Ensure local MW versions match expected deployment on mw2400 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:44] PROBLEM - Ensure local MW versions match expected deployment on mw2370 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:46] PROBLEM - Ensure local MW versions match expected deployment on snapshot1013 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:48] PROBLEM - Ensure local MW versions match expected deployment on mw1387 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:48] PROBLEM - Ensure local MW versions match expected deployment on mw1408 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:48] PROBLEM - Ensure local MW versions match expected deployment on mw1344 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:48] PROBLEM - Ensure local MW versions match expected deployment on mw1379 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:50] PROBLEM - Ensure local MW versions match expected deployment on mw1425 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:51:52] PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22877 and previous config saved to /var/cache/conftool/dbconfig/20220321-105159-marostegui.json [10:52:02] PROBLEM - Ensure local MW versions match expected deployment on mw2327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:02] PROBLEM - Ensure local MW versions match expected deployment on parse2013 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:02] PROBLEM - Ensure local MW versions match expected deployment on mw2352 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:02] PROBLEM - Ensure local MW versions match expected deployment on parse2012 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:10] PROBLEM - Ensure local MW versions match expected deployment on mw2398 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:10] PROBLEM - Ensure local MW versions match expected deployment on mw2399 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:28] PROBLEM - Ensure local MW versions match expected deployment on mw2355 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:30] PROBLEM - Ensure local MW versions match expected deployment on mw2339 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:35] (03PS6) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [10:52:40] PROBLEM - Ensure local MW versions match expected deployment on mw1397 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:40] PROBLEM - Ensure local MW versions match expected deployment on wtp1028 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:40] PROBLEM - Ensure local MW versions match expected deployment on mw2253 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:46] PROBLEM - Ensure local MW versions match expected deployment on mw1384 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:46] PROBLEM - Ensure local MW versions match expected deployment on mw2382 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:52] PROBLEM - Ensure local MW versions match expected deployment on mw2363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:52] PROBLEM - Ensure local MW versions match expected deployment on mw2263 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:52] PROBLEM - Ensure local MW versions match expected deployment on mw2282 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:55] (03CR) 10Jbond: [C: 03+2] P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond) [10:52:56] PROBLEM - Ensure local MW versions match expected deployment on mw2321 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:56] PROBLEM - Ensure local MW versions match expected deployment on mw2316 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:52:56] PROBLEM - Ensure local MW versions match expected deployment on parse2011 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:04] PROBLEM - Ensure local MW versions match expected deployment on labweb1001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:04] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:06] PROBLEM - Ensure local MW versions match expected deployment on mw2328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:06] PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:06] PROBLEM - Ensure local MW versions match expected deployment on mw2257 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:10] PROBLEM - Ensure local MW versions match expected deployment on mw2397 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:12] PROBLEM - Ensure local MW versions match expected deployment on mw1437 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:16] PROBLEM - Ensure local MW versions match expected deployment on mw2378 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:30] PROBLEM - Ensure local MW versions match expected deployment on mw1430 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:32] PROBLEM - Ensure local MW versions match expected deployment on mw2356 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:36] PROBLEM - Ensure local MW versions match expected deployment on mw2374 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:38] PROBLEM - Ensure local MW versions match expected deployment on mw2275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:38] PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [10:53:48] PROBLEM - Ensure local MW versions match expected deployment on mw2312 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:53:48] PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:02] PROBLEM - Ensure local MW versions match expected deployment on mw2269 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:06] PROBLEM - Ensure local MW versions match expected deployment on mw2299 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:08] PROBLEM - Ensure local MW versions match expected deployment on mw2336 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:16] PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:16] PROBLEM - Ensure local MW versions match expected deployment on mw2284 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:24] PROBLEM - Ensure local MW versions match expected deployment on parse2008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:30] PROBLEM - Ensure local MW versions match expected deployment on mw2380 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:40] PROBLEM - Ensure local MW versions match expected deployment on mwmaint2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:54:48] <_joe_> jbond: can you merge your change and mine please? [10:55:00] PROBLEM - Ensure local MW versions match expected deployment on mw2287 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:10] <_joe_> ah nevermind lock releases [10:55:12] PROBLEM - Ensure local MW versions match expected deployment on wtp1033 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:16] PROBLEM - Ensure local MW versions match expected deployment on mw2360 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:18] PROBLEM - Ensure local MW versions match expected deployment on mw2406 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:28] PROBLEM - Ensure local MW versions match expected deployment on mw2396 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:30] yes i think i was mid ay through :) [10:55:38] PROBLEM - Ensure local MW versions match expected deployment on mw1419 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:42] PROBLEM - Ensure local MW versions match expected deployment on mw2353 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:55:56] PROBLEM - Ensure local MW versions match expected deployment on mw1399 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:56:06] PROBLEM - Ensure local MW versions match expected deployment on mw1373 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:56:10] PROBLEM - Ensure local MW versions match expected deployment on mw2365 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:56:10] PROBLEM - Ensure local MW versions match expected deployment on mw2288 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:56:52] PROBLEM - Ensure local MW versions match expected deployment on mw2411 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:56:54] PROBLEM - Ensure local MW versions match expected deployment on mw2272 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:57:06] PROBLEM - Ensure local MW versions match expected deployment on mw1421 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:57:06] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:58:28] PROBLEM - Ensure local MW versions match expected deployment on mw1348 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:58:32] RECOVERY - Ensure local MW versions match expected deployment on mw1311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:58:34] RECOVERY - Ensure local MW versions match expected deployment on mw2402 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:58:42] RECOVERY - Ensure local MW versions match expected deployment on mw1425 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:58:46] PROBLEM - Ensure local MW versions match expected deployment on wtp1048 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:58:52] PROBLEM - Ensure local MW versions match expected deployment on mw1336 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:59:10] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:59:28] RECOVERY - Ensure local MW versions match expected deployment on mw1316 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:59:28] PROBLEM - Ensure local MW versions match expected deployment on mw2350 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:59:34] RECOVERY - Ensure local MW versions match expected deployment on mw1384 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:59:36] RECOVERY - Ensure local MW versions match expected deployment on mw1416 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:59:40] RECOVERY - Ensure local MW versions match expected deployment on mw2363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:59:52] RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:59:52] PROBLEM - Ensure local MW versions match expected deployment on mw1343 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:59:58] PROBLEM - Ensure local MW versions match expected deployment on mw1329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [11:00:04] RECOVERY - Ensure local MW versions match expected deployment on mw2378 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:12] RECOVERY - Ensure local MW versions match expected deployment on mw1413 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:22] RECOVERY - Ensure local MW versions match expected deployment on mw2356 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:24] RECOVERY - Ensure local MW versions match expected deployment on mw2373 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:26] RECOVERY - Ensure local MW versions match expected deployment on mw1375 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:26] RECOVERY - Ensure local MW versions match expected deployment on mw1319 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:32] PROBLEM - Ensure local MW versions match expected deployment on mw1380 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [11:00:36] RECOVERY - Ensure local MW versions match expected deployment on mw1429 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:42] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:00:52] RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:00:54] RECOVERY - Ensure local MW versions match expected deployment on mw2336 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:04] RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:22] RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:34] RECOVERY - Ensure local MW versions match expected deployment on mw1392 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:38] RECOVERY - Ensure local MW versions match expected deployment on mw1439 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:48] PROBLEM - Ensure local MW versions match expected deployment on mw2295 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [11:01:48] RECOVERY - Ensure local MW versions match expected deployment on mw2287 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:01:50] no idea why `wikiversions in sync` is failing [11:01:54] we are in the middle of running the train [11:02:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:02:08] RECOVERY - Ensure local MW versions match expected deployment on mw2406 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:12] RECOVERY - Ensure local MW versions match expected deployment on mw1394 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:12] RECOVERY - Ensure local MW versions match expected deployment on mw1414 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:16] RECOVERY - Ensure local MW versions match expected deployment on mw2307 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:20] RECOVERY - Ensure local MW versions match expected deployment on mw2396 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:22] RECOVERY - Ensure local MW versions match expected deployment on mw1339 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:26] RECOVERY - Ensure local MW versions match expected deployment on mw1350 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:26] RECOVERY - Ensure local MW versions match expected deployment on mw1320 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:26] RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:28] RECOVERY - Ensure local MW versions match expected deployment on mw1419 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:34] RECOVERY - Ensure local MW versions match expected deployment on mw2326 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:34] RECOVERY - Ensure local MW versions match expected deployment on mw2353 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:34] RECOVERY - Ensure local MW versions match expected deployment on snapshot1010 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw1441 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw1426 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw1399 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw1422 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:54] RECOVERY - Ensure local MW versions match expected deployment on mw1443 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:02:56] RECOVERY - Ensure local MW versions match expected deployment on mw1373 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:00] RECOVERY - Ensure local MW versions match expected deployment on mw2365 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:00] RECOVERY - Ensure local MW versions match expected deployment on mw2289 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:00] RECOVERY - Ensure local MW versions match expected deployment on mw2288 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:06] RECOVERY - Ensure local MW versions match expected deployment on mw2392 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:14] RECOVERY - Ensure local MW versions match expected deployment on wtp1025 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:28] RECOVERY - Ensure local MW versions match expected deployment on mw1396 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:28] RECOVERY - Ensure local MW versions match expected deployment on mw1367 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:28] RECOVERY - Ensure local MW versions match expected deployment on mw1342 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:32] RECOVERY - Ensure local MW versions match expected deployment on mw1390 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:36] RECOVERY - Ensure local MW versions match expected deployment on mw1409 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:36] RECOVERY - Ensure local MW versions match expected deployment on mw1412 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:36] RECOVERY - Ensure local MW versions match expected deployment on mw1420 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:44] RECOVERY - Ensure local MW versions match expected deployment on mw2391 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:44] RECOVERY - Ensure local MW versions match expected deployment on mw2411 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:44] RECOVERY - Ensure local MW versions match expected deployment on mw2335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:44] RECOVERY - Ensure local MW versions match expected deployment on mw2272 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:56] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:03:58] RECOVERY - Ensure local MW versions match expected deployment on mw1421 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:58] RECOVERY - Ensure local MW versions match expected deployment on mw1454 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:03:58] RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:04:28] RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:04:30] RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:04:42] RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:04:42] RECOVERY - Ensure local MW versions match expected deployment on mw2286 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:02] RECOVERY - Ensure local MW versions match expected deployment on mw1403 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:04] RECOVERY - Ensure local MW versions match expected deployment on mw1447 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:06] RECOVERY - Ensure local MW versions match expected deployment on mw1345 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:12] RECOVERY - Ensure local MW versions match expected deployment on mw1371 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:12] RECOVERY - Ensure local MW versions match expected deployment on mw1377 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:14] RECOVERY - Ensure local MW versions match expected deployment on mw1338 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:18] RECOVERY - Ensure local MW versions match expected deployment on mw1347 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:18] RECOVERY - Ensure local MW versions match expected deployment on mw1445 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:18] RECOVERY - Ensure local MW versions match expected deployment on mw1318 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:20] RECOVERY - Ensure local MW versions match expected deployment on mw1450 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:22] RECOVERY - Ensure local MW versions match expected deployment on mw1348 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:26] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:26] RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:26] RECOVERY - Ensure local MW versions match expected deployment on mw2314 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:26] RECOVERY - Ensure local MW versions match expected deployment on mw2407 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:28] RECOVERY - Ensure local MW versions match expected deployment on mw2265 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:28] RECOVERY - Ensure local MW versions match expected deployment on mw2264 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:30] RECOVERY - Ensure local MW versions match expected deployment on mw2400 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:30] RECOVERY - Ensure local MW versions match expected deployment on mw2370 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:30] RECOVERY - Ensure local MW versions match expected deployment on snapshot1013 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:34] RECOVERY - Ensure local MW versions match expected deployment on mw1387 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:34] RECOVERY - Ensure local MW versions match expected deployment on mw1408 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:34] RECOVERY - Ensure local MW versions match expected deployment on mw1344 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:34] RECOVERY - Ensure local MW versions match expected deployment on mw1379 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:34] RECOVERY - Ensure local MW versions match expected deployment on mw1332 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:36] RECOVERY - Ensure local MW versions match expected deployment on mw2276 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:36] RECOVERY - Ensure local MW versions match expected deployment on mw2279 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:38] RECOVERY - Ensure local MW versions match expected deployment on wtp1048 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:38] RECOVERY - Ensure local MW versions match expected deployment on mw2283 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:42] RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:44] RECOVERY - Ensure local MW versions match expected deployment on mw1336 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:48] RECOVERY - Ensure local MW versions match expected deployment on mw2327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:48] RECOVERY - Ensure local MW versions match expected deployment on mw2409 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:48] RECOVERY - Ensure local MW versions match expected deployment on mw2367 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:48] RECOVERY - Ensure local MW versions match expected deployment on parse2012 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:48] RECOVERY - Ensure local MW versions match expected deployment on parse2013 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:49] RECOVERY - Ensure local MW versions match expected deployment on mw2352 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:52] RECOVERY - Ensure local MW versions match expected deployment on mw2399 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:52] RECOVERY - Ensure local MW versions match expected deployment on mw2398 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:52] RECOVERY - Ensure local MW versions match expected deployment on parse2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:05:52] RECOVERY - Ensure local MW versions match expected deployment on mw2255 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:04] RECOVERY - Ensure local MW versions match expected deployment on mw1331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:10] RECOVERY - Ensure local MW versions match expected deployment on mw2355 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:10] RECOVERY - Ensure local MW versions match expected deployment on parse2004 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:10] RECOVERY - Ensure local MW versions match expected deployment on mw2339 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:16] RECOVERY - Ensure local MW versions match expected deployment on mw1418 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:18] RECOVERY - Ensure local MW versions match expected deployment on mw2350 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:20] RECOVERY - Ensure local MW versions match expected deployment on mw1397 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:20] RECOVERY - Ensure local MW versions match expected deployment on mw1346 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:20] RECOVERY - Ensure local MW versions match expected deployment on wtp1028 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:20] RECOVERY - Ensure local MW versions match expected deployment on wtp1043 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:20] RECOVERY - Ensure local MW versions match expected deployment on mw2315 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:22] RECOVERY - Ensure local MW versions match expected deployment on mw2253 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:26] RECOVERY - Ensure local MW versions match expected deployment on mw1366 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:26] RECOVERY - Ensure local MW versions match expected deployment on mw2382 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:28] RECOVERY - Ensure local MW versions match expected deployment on mw1449 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:28] RECOVERY - Ensure local MW versions match expected deployment on mw2254 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:32] RECOVERY - Ensure local MW versions match expected deployment on mw2300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:34] RECOVERY - Ensure local MW versions match expected deployment on mw2263 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:34] RECOVERY - Ensure local MW versions match expected deployment on mw2260 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:34] RECOVERY - Ensure local MW versions match expected deployment on mw2282 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:36] RECOVERY - Ensure local MW versions match expected deployment on mw2316 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:36] RECOVERY - Ensure local MW versions match expected deployment on mw2321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:36] RECOVERY - Ensure local MW versions match expected deployment on parse2011 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:40] RECOVERY - Ensure local MW versions match expected deployment on snapshot1012 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:40] RECOVERY - Ensure local MW versions match expected deployment on mw1448 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:44] RECOVERY - Ensure local MW versions match expected deployment on mw1411 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:46] RECOVERY - Ensure local MW versions match expected deployment on labweb1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:46] RECOVERY - Ensure local MW versions match expected deployment on mw1343 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:46] RECOVERY - Ensure local MW versions match expected deployment on mw2328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:46] RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:46] RECOVERY - Ensure local MW versions match expected deployment on mw2257 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:50] RECOVERY - Ensure local MW versions match expected deployment on mw2397 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:50] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:06:52] RECOVERY - Ensure local MW versions match expected deployment on mw1329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:54] RECOVERY - Ensure local MW versions match expected deployment on mw1437 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:56] RECOVERY - Ensure local MW versions match expected deployment on mw2262 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:06:56] RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22878 and previous config saved to /var/cache/conftool/dbconfig/20220321-110705-marostegui.json [11:07:06] RECOVERY - Ensure local MW versions match expected deployment on wtp1047 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:08] RECOVERY - Ensure local MW versions match expected deployment on wtp1030 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:10] RECOVERY - Ensure local MW versions match expected deployment on mw1335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:10] RECOVERY - Ensure local MW versions match expected deployment on wtp1029 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:14] RECOVERY - Ensure local MW versions match expected deployment on mw1430 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:16] RECOVERY - Ensure local MW versions match expected deployment on mw2333 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:16] RECOVERY - Ensure local MW versions match expected deployment on mw2309 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:18] RECOVERY - Ensure local MW versions match expected deployment on mw1417 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:18] RECOVERY - Ensure local MW versions match expected deployment on mw1424 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:18] RECOVERY - Ensure local MW versions match expected deployment on mw1431 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:18] RECOVERY - Ensure local MW versions match expected deployment on mw2374 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:20] RECOVERY - Ensure local MW versions match expected deployment on mw2275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:26] RECOVERY - Ensure local MW versions match expected deployment on mw1380 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:30] RECOVERY - Ensure local MW versions match expected deployment on mw1382 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:32] RECOVERY - Ensure local MW versions match expected deployment on mw2312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:32] RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:32] RECOVERY - Ensure local MW versions match expected deployment on mw2337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:32] RECOVERY - Ensure local MW versions match expected deployment on mw2267 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:44] RECOVERY - Ensure local MW versions match expected deployment on mw2269 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:48] RECOVERY - Ensure local MW versions match expected deployment on mw1433 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:48] RECOVERY - Ensure local MW versions match expected deployment on mw2299 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:48] RECOVERY - Ensure local MW versions match expected deployment on mw2405 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:48] RECOVERY - Ensure local MW versions match expected deployment on mw2294 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:56] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:07:56] RECOVERY - Ensure local MW versions match expected deployment on wtp1044 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:07:56] RECOVERY - Ensure local MW versions match expected deployment on mw1374 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:00] RECOVERY - Ensure local MW versions match expected deployment on wtp1046 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:00] RECOVERY - Ensure local MW versions match expected deployment on mw2323 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:00] RECOVERY - Ensure local MW versions match expected deployment on mw2284 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:08] RECOVERY - Ensure local MW versions match expected deployment on mw2379 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:08] RECOVERY - Ensure local MW versions match expected deployment on parse2008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:16] RECOVERY - Ensure local MW versions match expected deployment on mw2380 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:20] RECOVERY - Ensure local MW versions match expected deployment on mw2368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:26] RECOVERY - Ensure local MW versions match expected deployment on parse2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:26] RECOVERY - Ensure local MW versions match expected deployment on mwmaint2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:34] RECOVERY - Ensure local MW versions match expected deployment on mw1435 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:46] RECOVERY - Ensure local MW versions match expected deployment on mw1415 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:46] RECOVERY - Ensure local MW versions match expected deployment on mw2295 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:08:58] RECOVERY - Ensure local MW versions match expected deployment on wtp1033 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:09:02] RECOVERY - Ensure local MW versions match expected deployment on mw2360 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:09:06] RECOVERY - Ensure local MW versions match expected deployment on mw1313 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [11:12:24] so for some reason scap / rsync is still running [11:12:26] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10jbond) p:05Triage→03Medium [11:12:34] we are letting it run and having lunch [11:12:46] that should get us testwikis promoted to 1.39.0-wmf.1 when done [11:14:21] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10dcaro) FYI. I got a few emalis like this one regarding cloudvirt1016 this weekend: ` Date: Sat, 19 Mar 2022 04:07:39 +0000 From: r... [11:15:04] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:26] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:18:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Introduce cert-manager alerts [alerts] - 10https://gerrit.wikimedia.org/r/771687 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [11:19:20] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:20:33] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I'm sorry to be a pain, but I'm under some pressure to implement this new service as soon as it's practicable, for which I really need help from #servi... [11:22:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22879 and previous config saved to /var/cache/conftool/dbconfig/20220321-112210-marostegui.json [11:22:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:22:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:16] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298557)', diff saved to https://phabricator.wikimedia.org/P22880 and previous config saved to /var/cache/conftool/dbconfig/20220321-112217-marostegui.json [11:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:26] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10jbond) [11:23:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:45] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:28:11] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:29:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:30:45] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10jbond) [11:31:03] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:31:23] the eqiad/codfw BFD alerts should be related to the Telia transport [11:32:56] there seems to be planned work afaics [11:33:19] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@cbc85d3] (codfw): Update kartotherian to 2ef5c2d [11:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:39] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:33:45] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10jbond) p:05Triage→03Medium @KFrancis are you able to confirm NDA status for mark i dont see them i the spread sheet, thanks @thcipriani would you be the correct person to act as... [11:34:55] 10SRE, 10observability, 10serviceops, 10Patch-For-Review: aggregate mismatched wikiversions alert - https://phabricator.wikimedia.org/T302832 (10jbond) p:05Triage→03Medium [11:35:04] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.1 refs T300203 (duration: 81m 15s) [11:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:09] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [11:35:36] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jbond) p:05Triage→03Medium [11:36:10] 10SRE, 10DNS, 10Traffic, 10Traffic-Icebox, 10Sustainability (Incident Followup): Automate DNS depools such that manual commits are not required - https://phabricator.wikimedia.org/T303219 (10jbond) p:05Triage→03Medium [11:36:10] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@cbc85d3] (codfw): Update kartotherian to 2ef5c2d (duration: 02m 51s) [11:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:28] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@cbc85d3] (eqiad): Update kartotherian to 2ef5c2d [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:10] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10jbond) p:05Triage→03Medium [11:38:08] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@cbc85d3] (eqiad): Update kartotherian to 2ef5c2d (duration: 01m 40s) [11:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:07] (03PS1) 10Vgutierrez: check_ssl: Set a 46 hours warning threshold for OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/772373 (https://phabricator.wikimedia.org/T304047) [11:39:35] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10jbond) p:05Triage→03Medium [11:41:13] !log jnuche@deploy1002 Pruned MediaWiki: 1.38.0-wmf.25 (duration: 01m 32s) [11:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:26] 10SRE, 10Traffic, 10serviceops: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T303305 (10jbond) p:05Triage→03Medium [11:41:50] 10SRE, 10Traffic, 10Patch-For-Review: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) p:05Triage→03Medium [11:42:03] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10jbond) p:05Triage→03Medium [11:42:55] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10jbond) p:05Triage→03Medium [11:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298557)', diff saved to https://phabricator.wikimedia.org/P22881 and previous config saved to /var/cache/conftool/dbconfig/20220321-114527-marostegui.json [11:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:31] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:46:30] (03PS3) 10Phuedx: Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 [11:46:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:50:54] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10jbond) p:05Triage→03Medium [11:51:27] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Is it possible to put more RAM in cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? - https://phabricator.wikimedia.org/T303840 (10jbond) p:05Triage→03Medium [11:51:27] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:53:41] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:54:17] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:54:45] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:56:10] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1075 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772366 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:56:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:57:17] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10jbond) p:05Triage→03Medium [11:57:31] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:58:30] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10jbond) p:05Triage→03Medium [11:58:35] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:58:51] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:59:02] 10SRE, 10Traffic, 10serviceops: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T303305 (10Joe) 05Open→03Resolved a:03Joe This happened during an outage. That is the tls terminator of the application servers (envoy) circuit-breaking... [12:00:04] Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1200) [12:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22882 and previous config saved to /var/cache/conftool/dbconfig/20220321-120032-marostegui.json [12:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:15] 10SRE, 10Gerrit, 10GitLab, 10Horizon, and 2 others: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10jbond) p:05Triage→03Medium @Reedy is the suggestion to implement this for production ssh or just for gitlab? FTR i think both are useful but it will dictat... [12:02:55] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:07:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.2 [core] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772375 [12:07:18] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.2 [core] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772375 (owner: 10TrainBranchBot) [12:07:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22883 and previous config saved to /var/cache/conftool/dbconfig/20220321-121537-marostegui.json [12:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:33] We are resuming the train after lunch break. Now going for group 0 [12:16:43] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:17:17] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:18:59] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:20:27] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:57] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.1 refs T300203 [12:22:57] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.2 [core] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772375 (owner: 10TrainBranchBot) [12:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:02] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [12:25:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:25:49] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:28:45] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:29:03] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:11] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:30:10] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10jbond) [12:30:20] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10jbond) p:05Triage→03Medium [12:30:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298557)', diff saved to https://phabricator.wikimedia.org/P22884 and previous config saved to /var/cache/conftool/dbconfig/20220321-123042-marostegui.json [12:30:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:30:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:30:48] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298557)', diff saved to https://phabricator.wikimedia.org/P22885 and previous config saved to /var/cache/conftool/dbconfig/20220321-123055-marostegui.json [12:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:25] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771998 (https://phabricator.wikimedia.org/T303752) (owner: 10Stang) [12:41:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:42:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10jbond) 05Open→03Resolved a:03jbond @MGerlach i have you to the nda ldap group and you should be able to access the requested sites, please re-open if you have any issues [12:42:38] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.38.0-wmf.26" [12:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:51] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:45:07] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10Ottomata) > I haven't created TLS certificates for datahub.wikimedia.org I don't believe you will need a cert for this, IIUC it should use the wikimedia.org wil... [12:45:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for ShubhankarP - https://phabricator.wikimedia.org/T303703 (10jbond) 05Open→03Resolved a:03jbond i have now added Shubhankar to the NDA group, please re-open if any issues [12:45:57] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:46:44] (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772390 [12:46:46] (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772390 (owner: 10Jaime Nuche) [12:46:48] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772391 [12:46:50] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772391 (owner: 10Jaime Nuche) [12:46:52] (03PS1) 10Jaime Nuche: Revert "group0 wikis to 1.39.0-wmf.1 refs T300203" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772392 [12:46:54] (03CR) 10Jaime Nuche: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.1 refs T300203" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772392 (owner: 10Jaime Nuche) [12:47:01] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) p:05Triage→03Medium [12:47:08] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) @KFrancis are you able to confirm or arrange NDA status for @TheDJ [12:48:17] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772390 (owner: 10Jaime Nuche) [12:48:24] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772391 (owner: 10Jaime Nuche) [12:48:35] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.1 refs T300203" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772392 (owner: 10Jaime Nuche) [12:48:47] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:17] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:58] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10jbond) @OTichonova you are already part of the WMF ldap group which should give you access to all the services you are requesting. are you seeing an error? [12:54:12] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10TheDJ) Related older NDA ticket: {T127430} [12:55:03] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:56:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:56:19] (03CR) 10Ottomata: [C: 03+1] P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond) [12:58:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [12:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:43] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 4/5 UP : 4 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1300). Please do the needful. [13:00:05] Lucas_WMDE, koi, phuedx, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] * urbanecm waves [13:00:19] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:24] o/ [13:00:30] 0 0 [13:00:32] o/ [13:00:36] Lucas_WMDE: can you deploy please? [13:00:39] sure [13:00:42] (I can test my patch, but I'd prefer not to deploy today) [13:00:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:01:23] my patch is volunteer stuff but I’ll test it as Lucas_WMDE and spare everyone the charade of logging into IRC twice ^^ [13:01:28] alright, let’s look at the calendar [13:02:05] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:02:20] (03PS3) 10Lucas Werkmeister (WMDE): Remove changetags right from users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771691 (https://phabricator.wikimedia.org/T303682) (owner: 10Lucas Werkmeister) [13:02:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’ll check on mwdebug that this has only the intended effect on the effective groups and no other change (the diffConfig job isn’t as usef" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771691 (https://phabricator.wikimedia.org/T303682) (owner: 10Lucas Werkmeister) [13:03:13] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:03:44] (03Merged) 10jenkins-bot: Remove changetags right from users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771691 (https://phabricator.wikimedia.org/T303682) (owner: 10Lucas Werkmeister) [13:04:13] there’s some “Duplicate entry '…' for key 'PRIMARY'” stuff in logspam-watch, known? [13:04:19] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10Urbanecm) >>! In T303376#7792402, @jbond wrote: > @OTichonova you are already part of the WMF ldap group which should give you access to all the services you are requesting. are you seeing an... [13:04:24] seems to be the same db host [13:05:08] testing my config change on mwdebug1001 [13:05:35] not known to me [13:06:43] hashar, jnuche: do you know about those errors? [13:07:34] hi [13:07:41] yeah [13:07:46] I reopened a task about it [13:08:05] ok thanks [13:08:08] or more precisely : Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-rollback-callbacks') [13:08:09] just checking [13:08:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:20] ah, that thing [13:08:26] joy [13:08:33] which seems to hide the actual error : Duplicate entry '17604325-1515299' for key 'PRIMARY' (db1157) INSERT INTO echo_notification (notification_event,notification_user,notification_timestamp,notification_read_timestamp,notification_bundle_hash) VALUES (1515299,17604325,'20220321124005',NULL,'') db1157 [13:08:56] yup, all the logspam-watch messages are for db1157 too [13:08:58] and jnuche has file a task about EchoNotifications having an incorrect timestamp [13:09:01] I guess it’s one of the primaries [13:09:01] so maybe they are related [13:09:06] yup [13:09:11] for mediawiki.org I guess [13:09:27] anyways, my config change seems to have the desired effect in https://www.wikidata.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=usergroups, and no other changes, so I’ll go ahead and sync that [13:09:46] (03CR) 10Filippo Giunchedi: "I like it overall!" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:10:41] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:11:20] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771691|Remove changetags right from users on wikidatawiki and testwikidatawiki (T303682)]] (while keeping applychangetags right) (duration: 00m 49s) [13:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:24] T303682: Requested wiki configuration change: remove changetags right from users on wikidatawiki - https://phabricator.wikimedia.org/T303682 [13:12:17] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:08] (03PS2) 10Lucas Werkmeister (WMDE): Create "pagemover" group at azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771998 (https://phabricator.wikimedia.org/T303752) (owner: 10Stang) [13:14:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Create "pagemover" group at azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771998 (https://phabricator.wikimedia.org/T303752) (owner: 10Stang) [13:15:23] (03Merged) 10jenkins-bot: Create "pagemover" group at azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771998 (https://phabricator.wikimedia.org/T303752) (owner: 10Stang) [13:16:02] koi: your change is on mwdebug1001, please test it [13:17:15] lgtm, thanks [13:17:20] ok thanks [13:18:32] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771998|Create "pagemover" group at azwiki (T303752)]] (duration: 00m 50s) [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:37] T303752: Add page mover user group to az.wiki and disable "move/move-categorypages" permissions for autoconfirmed - https://phabricator.wikimedia.org/T303752 [13:18:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove unused wgWMESearchRelevancePages config variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 (owner: 10Phuedx) [13:19:03] (03PS4) 10Lucas Werkmeister (WMDE): Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 (owner: 10Phuedx) [13:20:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 (owner: 10Phuedx) [13:20:03] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:20:42] !log otto@deploy1002 Started deploy [analytics/refinery@11909fa] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 [13:20:43] (03Merged) 10jenkins-bot: Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 (owner: 10Phuedx) [13:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:46] T297939: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 [13:21:24] phuedx: your change is on mwdebug1001, can you test it? [13:21:28] (just check that nothing breaks, I guess) [13:21:42] Lucas_WMDE: I'll check that nothing breaks (ext.wikimediaEvents RL module is sent etc) [13:21:47] ok thanks [13:23:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:24:35] Lucas_WMDE: LGTM. Confirmed that WikimediaEvents-related ResourceLoader module is delivered without error and that some instruments are working as expected [13:24:41] great, thanks! [13:25:05] syncing [13:25:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:769748|Remove unused wgWMESearchRelevancePages config variable]] (duration: 00m 50s) [13:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:05] (03PS2) 10Lucas Werkmeister (WMDE): Revert "ptwiki: Disable Growth's image recommendation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771923 (https://phabricator.wikimedia.org/T304095) (owner: 10Urbanecm) [13:27:42] Lucas_WMDE: Thanks :) [13:27:47] np :) [13:28:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "ptwiki: Disable Growth's image recommendation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771923 (https://phabricator.wikimedia.org/T304095) (owner: 10Urbanecm) [13:28:30] urbanecm: I’m intrigued by this /* growthexperiments-addimage-summary-summary: 1 */ summary in https://phabricator.wikimedia.org/T304095 [13:28:44] is it implemented in the same way as in Wikibase? some format autocomment/autosummary hook? [13:28:49] (03Merged) 10jenkins-bot: Revert "ptwiki: Disable Growth's image recommendation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771923 (https://phabricator.wikimedia.org/T304095) (owner: 10Urbanecm) [13:29:16] urbanecm: change is on mwdebug1001 now [13:29:35] !log otto@deploy1002 Finished deploy [analytics/refinery@11909fa] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 (duration: 08m 53s) [13:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] T297939: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 [13:30:31] Lucas_WMDE: yes [13:30:41] nice, thanks [13:30:42] and testing [13:31:54] Lucas_WMDE: works fine on my end (https://pt.wikipedia.org/w/index.php?title=Quadratura_do_quadrado&diff=63232608&oldid=49170747) [13:31:57] please sync [13:32:03] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:32:22] thanks [13:32:51] syncing [13:33:03] (03PS15) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [13:33:23] Lucas_WMDE: if you're wondering, edit summary is parsed in https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/includes/HomepageHooks.php#L1284 [13:33:26] 10SRE, 10Traffic-Icebox: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) a:05CDanis→03None [13:33:35] yup, I’d found the same code already :) [13:33:36] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771923|Revert "ptwiki: Disable Growth's image recommendation" (T304095)]] (duration: 00m 49s) [13:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:40] good :) [13:33:40] T304095: ptwiki: Re-deploy add an image - https://phabricator.wikimedia.org/T304095 [13:33:43] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:58] alright, backport window done I think [13:33:59] it's the solution we went for to avoid issues with user's language vs wiki's language (sometimes, edit summaries ended up in a different language than what we want them in) [13:34:02] unless anyone wants to squeeze in at the last second [13:34:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298557)', diff saved to https://phabricator.wikimedia.org/P22887 and previous config saved to /var/cache/conftool/dbconfig/20220321-133407-marostegui.json [13:34:08] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:34:12] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10jbond) 05Open→03Resolved a:03jbond @AlexisJazz thanks for the report it appears that there was a small bip in traffic when we turned on our new DRMRS PoP. it seems the issue lasted only a few sec... [13:34:50] !log UTC afternoon backport window done [13:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:50] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:37:05] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:16] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:37:47] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:39:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks for taking a look" [puppet] - 10https://gerrit.wikimedia.org/r/772373 (https://phabricator.wikimedia.org/T304047) (owner: 10Vgutierrez) [13:40:29] (03CR) 10MVernon: "Honestly not sure about this - the puppetmaster code always uses Array[String], which is why I use it thus?" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:40:41] 10SRE, 10Wikimedia-Mailing-lists: Mailman3: 550-Support for list subscription via email has been disabled. - https://phabricator.wikimedia.org/T303888 (10jbond) p:05Triage→03Medium @Ladsgroup wonder if you would know if its preferable to update the message or enable email subscriptions? [13:41:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:42:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:44:57] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T303585 (10MoritzMuehlenhoff) 05Open→03Resolved Thanks, I've readded the new disk to the RAIDs (md0/md1 done, md2 will still be rebuilding for another 10 hours). [13:45:53] (03PS2) 10Giuseppe Lavagetto: varnish/tests: improve UX, refactor run.py [puppet] - 10https://gerrit.wikimedia.org/r/771863 [13:46:33] (03CR) 10jerkins-bot: [V: 04-1] varnish/tests: improve UX, refactor run.py [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [13:47:02] (03CR) 10JMeybohm: [C: 03+2] Introduce cert-manager alerts [alerts] - 10https://gerrit.wikimedia.org/r/771687 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [13:47:09] (03PS3) 10Giuseppe Lavagetto: varnish/tests: improve UX, refactor run.py [puppet] - 10https://gerrit.wikimedia.org/r/771863 [13:47:52] (03CR) 10Filippo Giunchedi: swift: deploy swift_ring_manager to one node per cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:48:55] (03CR) 10Vgutierrez: [C: 03+2] check_ssl: Set a 46 hours warning threshold for OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/772373 (https://phabricator.wikimedia.org/T304047) (owner: 10Vgutierrez) [13:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22888 and previous config saved to /var/cache/conftool/dbconfig/20220321-134912-marostegui.json [13:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:43] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/772398 [13:51:00] (03CR) 10Kosta Harlan: [C: 03+2] "Let's try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/772398 (owner: 10Kosta Harlan) [13:51:07] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) I'm trying to get back to this today/tomorrow. You don't need to create any TLS certificates and we can use Ingress for both, frontend and gms. [13:52:16] (03PS7) 10JMeybohm: Introduce cert-manager alerts [alerts] - 10https://gerrit.wikimedia.org/r/771687 (https://phabricator.wikimedia.org/T304092) [13:55:06] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/772398 (owner: 10Kosta Harlan) [13:55:33] !log otto@deploy1002 Started deploy [analytics/refinery@11909fa] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:38] T297939: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 [13:55:40] !log otto@deploy1002 Finished deploy [analytics/refinery@11909fa] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 (duration: 00m 06s) [13:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:16] (03PS1) 10Hashar: Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772350 (https://phabricator.wikimedia.org/T304307) [13:59:36] (03CR) 10Hashar: [C: 03+2] Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772350 (https://phabricator.wikimedia.org/T304307) (owner: 10Hashar) [13:59:48] (03PS1) 10Hashar: Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772351 (https://phabricator.wikimedia.org/T304307) [14:00:31] (03CR) 10Hashar: [C: 03+2] Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772351 (https://phabricator.wikimedia.org/T304307) (owner: 10Hashar) [14:00:47] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:01:15] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:55] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [14:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:11] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Vgutierrez) latest round of HAProxy reimages were performed between March 7th and March 8th: ` * 4d58564f87 - site: Reimage cp1083 as cache::text... [14:02:45] (03PS3) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) [14:02:54] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:59] (03PS2) 10JMeybohm: Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) [14:03:37] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [14:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [14:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [14:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22889 and previous config saved to /var/cache/conftool/dbconfig/20220321-140417-marostegui.json [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] !log depool cp1074 - T290005 [14:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:06:58] mmandere: I think that you mean cp1075 [14:07:06] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [14:07:08] !log depool cp1075 - T290005 [14:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:41] vgutierrez: correct, that was a typo [14:09:07] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:09:31] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [14:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:10] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [14:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:57] 10ops-eqiad: Eqiad: Remove lnk between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 (10Papaul) [14:14:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:17:27] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:18:46] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1075 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772366 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298557)', diff saved to https://phabricator.wikimedia.org/P22890 and previous config saved to /var/cache/conftool/dbconfig/20220321-141922-marostegui.json [14:19:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:19:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:33] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:19:34] (03Merged) 10jenkins-bot: Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772350 (https://phabricator.wikimedia.org/T304307) (owner: 10Hashar) [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:53] (03PS1) 10Giuseppe Lavagetto: varnish::frontend: install dynamic actions everywhere [puppet] - 10https://gerrit.wikimedia.org/r/772401 [14:19:56] jnuche: change merged :] [14:19:59] (03Merged) 10jenkins-bot: Revert "Call IDatabase::timestamp before inserting rows" [extensions/Echo] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772351 (https://phabricator.wikimedia.org/T304307) (owner: 10Hashar) [14:20:26] (03CR) 10jerkins-bot: [V: 04-1] varnish::frontend: install dynamic actions everywhere [puppet] - 10https://gerrit.wikimedia.org/r/772401 (owner: 10Giuseppe Lavagetto) [14:20:45] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:27] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:28] (03PS5) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [14:23:41] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:24:00] (03PS2) 10Giuseppe Lavagetto: varnish::frontend: install dynamic actions everywhere [puppet] - 10https://gerrit.wikimedia.org/r/772401 [14:24:04] (03CR) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:24:17] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:26:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34453/console" [puppet] - 10https://gerrit.wikimedia.org/r/772401 (owner: 10Giuseppe Lavagetto) [14:26:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:28:38] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.1/extensions/Echo: Revert "Call IDatabase::timestamp before inserting rows" - T304307 (duration: 00m 52s) [14:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] T304307: Wikimedia\Timestamp\TimestampException: Wikimedia\Timestamp\ConvertibleTimestamp::getTimestamp: The timestamp cannot be represented in the specified format - https://phabricator.wikimedia.org/T304307 [14:29:16] (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772404 [14:29:20] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772404 (owner: 10Hashar) [14:30:04] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772404 (owner: 10Hashar) [14:30:09] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1075.eqiad.wmnet with OS buster [14:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:22] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1075.eqiad.wmnet with OS buster [14:31:20] !log restarting Apache on gerrit2001 and gerrit1001 [14:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.1 refs T300203 [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:28] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [14:31:48] (03CR) 10Filippo Giunchedi: puppetmaster: rsync swift rings from each cluster's ring manager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:32:01] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:32:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] varnish::frontend: install dynamic actions everywhere [puppet] - 10https://gerrit.wikimedia.org/r/772401 (owner: 10Giuseppe Lavagetto) [14:34:17] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:34] I am checking logstash [14:35:45] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:35:57] !log Restarting CI Zuul server [14:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:41:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:18] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10BTullis) @RobH - Sorry to hijack this thread. Do you happen to know whether the PERC H750 will support JBOD mode? In the past we've had to use single-drive RAID0 volumes to get the same effect. It works,... [14:43:48] !log oblivian@puppetmaster1001 conftool action : set/enabled=false; selector: name=parameter_q,cluster=cache-text [14:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:45:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:45:21] (03PS1) 10Tchanders: Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) [14:47:00] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:30] (03PS1) 10SBassett: admin: replace existing ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/772410 (https://phabricator.wikimedia.org/T304319) [14:49:15] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:49:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:49:41] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage [14:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:39] (03CR) 10MVernon: swift::ring: deploy by tarball not individual files (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:50:49] (03PS5) 10MVernon: swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) [14:51:58] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:52:17] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:52:23] 10SRE, 10ops-eqiad: Eqiad: Remove link between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 (10ayounsi) [14:53:16] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:54:22] !log otto@deploy1002 Started deploy [analytics/refinery@33f66db] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 [14:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:26] T297939: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 [14:55:28] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:58] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:56:46] 10SRE, 10ops-eqiad: Eqiad: Remove link between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 (10ayounsi) Good catch! ` asw2-b-eqiad> show virtual-chassis vc-port member 5 fpc5: -------------------------------------------------------------------------- Interface Type Trunk... [14:56:58] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:03] !log asw2-b-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 2 member 5 - T304316 [14:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:07] T304316: Eqiad: Remove link between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 [14:58:13] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (0314 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:58:40] RECOVERY - Confd vcl based reload on cp5006 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:59:24] 10SRE, 10ops-eqiad: Eqiad: Remove link between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 (10ayounsi) ` asw2-b-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 1 member 6 fpc6: -------------------------------------------------------------------------- vc-port successful... [14:59:42] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22891 and previous config saved to /var/cache/conftool/dbconfig/20220321-150044-marostegui.json [15:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:48] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [15:01:18] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:31] !log otto@deploy1002 Finished deploy [analytics/refinery@33f66db] (hadoop-test): gobblin-wmf-core-1.0.1 - T297939 (duration: 07m 10s) [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:36] T297939: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 [15:01:52] (03PS1) 10Ottomata: gobblin - use gobblin-wmf-core-1.0.1 in hadoop-test [puppet] - 10https://gerrit.wikimedia.org/r/772416 (https://phabricator.wikimedia.org/T297939) [15:03:02] (03PS1) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:04:07] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34454/console" [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:04:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:04:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298557)', diff saved to https://phabricator.wikimedia.org/P22893 and previous config saved to /var/cache/conftool/dbconfig/20220321-150417-marostegui.json [15:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:22] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:04:54] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:28] jouncebot: now [15:05:29] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [15:05:32] (03PS2) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:06:30] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 13319 MB (5% inode=64%): /tmp 13319 MB (5% inode=64%): /var/tmp 13319 MB (5% inode=64%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [15:07:12] 10SRE, 10ops-eqiad: Eqiad: Remove link between fpc5 and fcp6 in row b - https://phabricator.wikimedia.org/T304316 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Removed the cable from both b5 and b6 [15:07:58] (03PS3) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:08:08] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:41] going to promote group 1 wikis to 1.39.0-wmf.1 [15:08:44] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:09:13] (03PS5) 10Elukey: profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) [15:09:46] (03PS4) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:10:18] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34457/console" [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:11:08] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:51] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10Volans) p:05Triage→03High [15:14:35] jouncebot: nowandnext [15:14:35] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [15:14:35] In 0 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1530) [15:14:48] 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) [15:15:06] 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10ayounsi) [15:15:15] Reedy: hashar was running the train now I think, in case you want to deploy something [15:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22894 and previous config saved to /var/cache/conftool/dbconfig/20220321-151549-marostegui.json [15:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] (03PS5) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:17:47] (03CR) 10jerkins-bot: [V: 04-1] hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:18:13] (03PS1) 10ArielGlenn: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772354 [15:18:30] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10Volans) p:05Triage→03High [15:19:08] (03PS1) 10ArielGlenn: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772355 [15:19:14] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1075.eqiad.wmnet with OS buster [15:19:16] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772420 [15:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:18] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772420 (owner: 10Jaime Nuche) [15:19:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1075.eqiad.wmnet with OS buster com... [15:19:35] (03CR) 10Reedy: [C: 03+2] Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772355 (owner: 10ArielGlenn) [15:19:43] (03CR) 10Reedy: [C: 03+2] Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772354 (owner: 10ArielGlenn) [15:19:47] 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) [15:19:51] 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) [15:20:17] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772420 (owner: 10Jaime Nuche) [15:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298557)', diff saved to https://phabricator.wikimedia.org/P22895 and previous config saved to /var/cache/conftool/dbconfig/20220321-152041-marostegui.json [15:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:46] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:21:16] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:21:33] (03PS1) 10ArielGlenn: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772356 [15:21:34] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772421 (https://phabricator.wikimedia.org/T128546) [15:21:38] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.1 refs T300203 [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:43] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [15:21:52] (03CR) 10Reedy: [C: 03+2] Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772356 (owner: 10ArielGlenn) [15:22:56] (03CR) 10Cwhite: [C: 03+2] profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:23:04] (03PS7) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [15:23:32] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:23:32] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.1 refs T300203 (duration: 01m 54s) [15:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:53] (03CR) 10jerkins-bot: [V: 04-1] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:24:32] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:24:33] (03PS8) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [15:25:18] (03PS6) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:26:28] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34461/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:27:02] (03CR) 10Razzi: [V: 03+1] "Good points @Btullis; updated." [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:27:02] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:27:03] !log pool cp1075 with HAProxy as TLS termination layer - T290005 [15:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:07] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:28:21] (03CR) 10Jbond: "took a pass lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (owner: 10Giuseppe Lavagetto) [15:28:52] (03PS7) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:29:38] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453 (owner: 10Jbond) [15:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1530) [15:30:54] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22896 and previous config saved to /var/cache/conftool/dbconfig/20220321-153054-marostegui.json [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:53] (03PS8) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:33:53] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772421 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:34:12] (03Merged) 10jenkins-bot: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772355 (owner: 10ArielGlenn) [15:34:41] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772421 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22897 and previous config saved to /var/cache/conftool/dbconfig/20220321-153547-marostegui.json [15:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:56] (03Merged) 10jenkins-bot: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772354 (owner: 10ArielGlenn) [15:37:05] (03Merged) 10jenkins-bot: Revert "Normalise maintenance requires" [extensions/Flow] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772356 (owner: 10ArielGlenn) [15:37:58] (03CR) 10Ottomata: [C: 03+2] gobblin - use gobblin-wmf-core-1.0.1 in hadoop-test [puppet] - 10https://gerrit.wikimedia.org/r/772416 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [15:38:40] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:38:43] Reedy: you can deploy the Flow backports :) [15:39:21] 10SRE, 10Gerrit, 10GitLab, 10Horizon, and 2 others: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10Reedy) I think we should try and implement it everywhere (eventually). Might be worth forking into subtasks for services/areas. And keep this as the overall tra... [15:39:28] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:772421| Bumping portals to master (T128546)]] (duration: 00m 53s) [15:39:30] jbond: ok to merge your pcc change? [15:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:39:45] ottomata: ahh yes please sorry forgot about that one [15:39:47] (03PS9) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:39:53] done ty [15:39:55] ty [15:40:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:22] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:772421| Bumping portals to master (T128546)]] (duration: 00m 53s) [15:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:43:32] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:48] !log reedy@deploy1002 Synchronized php-1.39.0-wmf.1/extensions/Flow/maintenance: T304318 (duration: 00m 51s) [15:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:52] T304318: Flow dumps are broken on all wikis due to MediaWIki update - https://phabricator.wikimedia.org/T304318 [15:44:14] !log razzi@cumin1001 START - Cookbook sre.wikireplicas.update-views [15:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:49] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/Flow/maintenance: T304318 (duration: 00m 49s) [15:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22898 and previous config saved to /var/cache/conftool/dbconfig/20220321-154559-marostegui.json [15:46:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:46:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:03] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [15:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T300775)', diff saved to https://phabricator.wikimedia.org/P22899 and previous config saved to /var/cache/conftool/dbconfig/20220321-154607-marostegui.json [15:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:46:52] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:47:01] (03CR) 10Filippo Giunchedi: P:icinga: add profile for performance tweaking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [15:47:12] thanks for those deploys, Ree dy [15:47:17] (03PS10) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:47:46] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:33] 10SRE: WMF-NDA vs ldap NDA - https://phabricator.wikimedia.org/T304329 (10jbond) [15:49:24] 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) [15:49:36] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:49:46] (03PS1) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T280767) [15:50:24] !log otto@deploy1002 Started deploy [analytics/refinery@33f66db] (hadoop-test): fix prometheus pushgateway url - T294420 [15:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:28] !log otto@deploy1002 Finished deploy [analytics/refinery@33f66db] (hadoop-test): fix prometheus pushgateway url - T294420 (duration: 00m 03s) [15:50:29] T294420: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 [15:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:37] !log otto@deploy1002 Started deploy [analytics/refinery@cd7bf7a] (hadoop-test): fix prometheus pushgateway url - T294420 [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:46] 10SRE, 10Traffic: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fix deployed, I'm closing the task now, feel free to reopen if the issue happens again. Thanks! [15:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22900 and previous config saved to /var/cache/conftool/dbconfig/20220321-155052-marostegui.json [15:50:54] (03PS1) 10Klausman: hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 [15:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] (03CR) 10Jgiannelos: [C: 04-1] "Blocking until deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [15:51:58] (03PS2) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) [15:52:10] (03CR) 10SBassett: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771696 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [15:52:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10akosiaris) [15:53:59] jouncebot: nowandnext [15:53:59] For the next 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1530) [15:53:59] In 1 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1700) [15:54:17] running the envoy update in codfw, it'll be noisy in here but no other impact expected [15:54:25] (03PS2) 10Reedy: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771696 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [15:54:29] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [15:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:32] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [15:54:33] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [15:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [15:54:36] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:37] (03CR) 10Reedy: [C: 03+2] Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771696 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [15:54:39] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:54:40] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:46] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:54:46] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:10] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:55:22] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10jbond) @Urbanecm thanks for the update @Ottomata @odimitrijevic could one of you please approve access to the analytics-privatedata-users, group for OTichonova [15:55:26] (03Merged) 10jenkins-bot: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771696 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [15:55:50] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:53] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [15:55:53] (03CR) 10Ssingh: P:icinga: add profile for performance tweaking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [15:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:16] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10herron) a:05lmata→03herron [15:56:24] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10herron) a:05lmata→03herron [15:56:28] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [15:56:29] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [15:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:35] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: T304111 (duration: 00m 50s) [15:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:39] T304111: Set StopForumSpam to enforce on beta cluster - https://phabricator.wikimedia.org/T304111 [15:57:03] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [15:57:04] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [15:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:20] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: archiva.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:50] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [15:57:51] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] !log otto@deploy1002 Finished deploy [analytics/refinery@cd7bf7a] (hadoop-test): fix prometheus pushgateway url - T294420 (duration: 07m 18s) [15:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:59] T294420: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 [15:58:04] (03PS2) 10Klausman: hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 [15:58:25] (03CR) 10Ssingh: P:icinga: add profile for performance tweaking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [15:58:39] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:58:41] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:59:37] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:49] (03PS1) 10MMandere: site: Reimage cp1077 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) [16:00:15] (03PS3) 10Klausman: hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 [16:00:27] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [16:00:29] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:47] (03PS1) 10Jbond: admin: update ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/772432 (https://phabricator.wikimedia.org/T304319) [16:02:12] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [16:02:14] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [16:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:06] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:28] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10Ottomata) Approved. FYI from the looks of it this is ssh-less access, so no ssh-key needed. [16:03:40] (03CR) 10Jbond: [C: 03+2] admin: update ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/772432 (https://phabricator.wikimedia.org/T304319) (owner: 10Jbond) [16:03:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:04:04] jouncebot: nowandnext [16:04:04] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [16:04:04] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [16:04:04] In 0 hour(s) and 55 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1700) [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:17] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O... [16:04:21] assuming the master gate-and-submit passes, I’d like to do a small backport to wmf.1 and also wmf.2, if that’s okay [16:04:28] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:04:31] sbassett: change is merged now it should be rolled out in ~30 mins [16:04:33] (backporting this change: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/772411) [16:05:41] (03PS1) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES, FR, & PT wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [16:05:43] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [16:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298557)', diff saved to https://phabricator.wikimedia.org/P22901 and previous config saved to /var/cache/conftool/dbconfig/20220321-160557-marostegui.json [16:05:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:06:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:01] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:56] (03CR) 10Vgutierrez: site: Reimage cp1077 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:08:26] jbond: thanks! [16:09:10] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:09:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) @MatthewVernon I'm still working through this with Juniper. No breakthrough yet unfortunately (it's been escalated up through their support tiers).... [16:09:27] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [16:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:07] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [16:10:08] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:33] (03PS2) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES, FR, & PT wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [16:10:49] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:10:50] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [16:10:50] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10MatthewVernon) Thanks for the update :) I'm content to wait for now; I'll get back to you if I start to need some of these servers more pressingly. [16:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:56] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:11:02] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye [16:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu... [16:11:21] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [16:11:23] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:44] 10SRE, 10Gerrit, 10GitLab, 10Horizon, and 2 others: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10Paladox) we don't have git:// enabled for gerrit, we have had ECDSA and Ed25519 for years now and we can disable the ssh algorithms we don't want using the gerr... [16:11:56] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:11:57] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:04] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:23] (03Abandoned) 10Tchanders: Autopromote-once users to the 'ipinfo-viewer' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [16:12:37] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:12:38] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [16:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:35] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [16:13:37] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [16:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:39] (03PS2) 10Elukey: Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744) [16:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:41] (03PS1) 10DCausse: elasticsearch: fix var reference [cookbooks] - 10https://gerrit.wikimedia.org/r/772434 [16:13:43] (03PS1) 10DCausse: elasticsearch: drop support for pausing writes [cookbooks] - 10https://gerrit.wikimedia.org/r/772435 [16:14:07] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:14] PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:58] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:15:32] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [16:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:44] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with O... [16:15:50] (03PS1) 10Jbond: admin: Add otich to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/772436 (https://phabricator.wikimedia.org/T303376) [16:15:56] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:16:25] (03Abandoned) 10DCausse: elasticsearch: fix var reference [cookbooks] - 10https://gerrit.wikimedia.org/r/772434 (owner: 10DCausse) [16:16:40] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:17:24] (03Abandoned) 10DCausse: elasticsearch: drop support for pausing writes [cookbooks] - 10https://gerrit.wikimedia.org/r/772435 (owner: 10DCausse) [16:18:10] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:18:14] (03PS3) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES, FR, & PT wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [16:19:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T304319 (10sbassett) @jbond - Just tested on deployment.eqiad; works. Thanks! I'll make this public now. [16:19:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T304319 (10sbassett) p:05Triage→03Lowest [16:20:19] (03PS1) 10Lucas Werkmeister (WMDE): Add display to wbsearchentities response even if empty [extensions/Wikibase] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772361 (https://phabricator.wikimedia.org/T104344) [16:20:33] (03PS1) 10Lucas Werkmeister (WMDE): Add display to wbsearchentities response even if empty [extensions/Wikibase] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772362 (https://phabricator.wikimedia.org/T104344) [16:20:40] jouncebot: nowandnext [16:20:40] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [16:20:41] In 0 hour(s) and 39 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1700) [16:20:52] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:56] I think that should be enough time. I’ll do those backports, they should be very safe [16:21:04] (and you still have ~20mins to yell stop at me during the gate-and-submit) [16:21:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backport" [extensions/Wikibase] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772361 (https://phabricator.wikimedia.org/T104344) (owner: 10Lucas Werkmeister (WMDE)) [16:21:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backport" [extensions/Wikibase] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772362 (https://phabricator.wikimedia.org/T104344) (owner: 10Lucas Werkmeister (WMDE)) [16:21:36] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:48] (03CR) 10DCausse: elasticsearch: remove custom restart handling (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [16:23:23] (03CR) 10AGueyte: Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [16:25:32] (03CR) 10DCausse: "we might want to delete cookbooks/sre/elasticsearch/force-unfreeze.py and the corresponding spicerack code as well" [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [16:29:04] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:04] (03PS2) 10Jbond: admin: Add otich to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/772436 (https://phabricator.wikimedia.org/T303376) [16:29:46] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:29:58] (03PS2) 10MMandere: site: Reimage cp1077 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) [16:30:28] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:31:34] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:31:55] (03CR) 10Jbond: [C: 03+2] admin: Add otich to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/772436 (https://phabricator.wikimedia.org/T303376) (owner: 10Jbond) [16:34:48] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:35:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10jbond) 05Open→03Resolved a:03jbond >>! In T303376#7793393, @Ottomata wrote: > Approved. FYI from the looks of it this is ssh-less access, so no ssh-key needed. tha... [16:37:20] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:37:39] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1077 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:39:02] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:46] (03PS1) 10Majavah: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772363 (https://phabricator.wikimedia.org/T304331) [16:40:01] (03PS1) 10Majavah: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772364 (https://phabricator.wikimedia.org/T304331) [16:40:15] (03PS1) 10Majavah: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772365 (https://phabricator.wikimedia.org/T304331) [16:40:53] (03Merged) 10jenkins-bot: Add display to wbsearchentities response even if empty [extensions/Wikibase] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772361 (https://phabricator.wikimedia.org/T104344) (owner: 10Lucas Werkmeister (WMDE)) [16:41:00] ^ I’ll deploy that backport to wmf.1 and wmf.2 (after testing it on mwdebug) [16:41:16] cc hashar jnuche brennen taavi just fyi :) [16:41:24] (03Merged) 10jenkins-bot: Add display to wbsearchentities response even if empty [extensions/Wikibase] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772362 (https://phabricator.wikimedia.org/T104344) (owner: 10Lucas Werkmeister (WMDE)) [16:41:27] Thanks! [16:41:29] ack! please ping me when done [16:41:35] ah, wmf.2 doesn’t exist on deploy1002 yet it seems, so not much to do there d) [16:41:37] ok [16:41:39] * :) [16:43:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:44:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.1/extensions/Wikibase/repo/: Backport: [[gerrit:772361|Add display to wbsearchentities response even if empty (T104344)]] (duration: 00m 53s) [16:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:25] T104344: Change wbsearchentities to explicitly return display terms and matched term. - https://phabricator.wikimedia.org/T104344 [16:44:27] taavi: I’m done [16:44:30] ok [16:45:17] (I assume wmf.2 will get the backport automatically when it starts rolling out in an hour) [16:45:35] Lucas_WMDE: correct! [16:45:36] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:44] yay [16:45:52] and thank you :) [16:46:11] 10SRE, 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10wiki_willy) a:03Cmjohnson [16:46:48] releng: ok for me to backport that fix? [16:46:54] !log razzi@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [16:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:39] dduvall: dancy: brennen: ^ [16:47:40] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:47:50] taavi: please do [16:47:57] +1 thank you [16:48:03] (03CR) 10Majavah: [C: 03+2] PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772363 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:48:05] (03CR) 10Majavah: [C: 03+2] PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772364 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:48:07] (03CR) 10Majavah: [C: 03+2] PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772365 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:48:24] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:49:37] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi) I'll take a stab at fixing this! [16:49:58] (03Merged) 10jenkins-bot: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/772363 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:50:01] (03Merged) 10jenkins-bot: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772364 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:50:04] that was quick [16:50:13] (03Merged) 10jenkins-bot: PageSplitter: check for OutputPage::getTitle() returning null [extensions/WikimediaEvents] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772365 (https://phabricator.wikimedia.org/T304331) (owner: 10Majavah) [16:50:22] nice. [16:51:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "By mater of a coincidence, I was reviewing gitlab docs the other day, and this indeed seems to be all what's necessary to enable the agent" [puppet] - 10https://gerrit.wikimedia.org/r/767249 (https://phabricator.wikimedia.org/T283894) (owner: 10Brennen Bearnes) [16:51:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:51:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [16:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [16:52:00] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:40] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:52:52] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:53:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:53:44] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaEvents/includes/PageSplitter/PageSplitterHooks.php: Backport: [[gerrit:772363|PageSplitter: check for OutputPage::getTitle() returning null (T304331)]] (duration: 00m 51s) [16:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] T304331: PageSplitterHooks: Error: Call to a member function exists() on null - https://phabricator.wikimedia.org/T304331 [16:54:10] PROBLEM - puppet last run on cp6009 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:55:43] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.1/extensions/WikimediaEvents/includes/PageSplitter/PageSplitterHooks.php: Backport: [[gerrit:772364|PageSplitter: check for OutputPage::getTitle() returning null (T304331)]] (duration: 00m 50s) [16:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:01] (03PS1) 10Jgiannelos: maps: Temporarily disable OSM sync for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/772442 [16:56:06] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6009 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 302634 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-06-03 16:37:41 +0000 (expires in 73 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:56:57] ok I think I'm done [16:57:05] (03CR) 10MSantos: [C: 03+1] maps: Temporarily disable OSM sync for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/772442 (owner: 10Jgiannelos) [16:57:11] thanks taavi! [16:57:28] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6009 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 302552 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-06-03 16:37:41 +0000 (expires in 73 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:58:24] !log trainsperiment (T300203): blockers currently cleared, will hold wmf.1 -> group2 until 18:00 UTC, per deployment calendar [16:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [16:59:12] (03PS1) 10Razzi: dbproxy: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/772443 (https://phabricator.wikimedia.org/T302233) [17:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1700). [17:00:30] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:31] Hi! We have a puppet patch for maps https://gerrit.wikimedia.org/r/c/operations/puppet/+/772442. Can somebody help us with merging? hnowlan (who usually works on puppet+maps) is out. mbsantos already gave +1 [17:01:02] RECOVERY - puppet last run on cp6009 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:01:34] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10KFrancis) @jbond It doesn't look like we have one on file. @MarkAHershberger Thanks for providing your email address. In order to process this request, I will also need your mailing... [17:02:44] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish [17:04:18] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:38] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [17:07:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:07:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22902 and previous config saved to /var/cache/conftool/dbconfig/20220321-170731-marostegui.json [17:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:35] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [17:09:10] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:09:28] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:00] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:20] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:13:03] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.107`. Pre-deploy tests passing on canary `wdqs1003` [17:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:26] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:23] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@2b67de7]: 0.3.107 [17:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:05] !log [WDQS Deploy] Tests passing following deploy of `0.3.107` on canary `wdqs1003`; proceeding to rest of fleet [17:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:43] nemo-yiannis: I can help you merge that after this WDQS deploy completes [17:15:54] thanks ryankemper ! [17:15:58] RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:11] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [17:16:11] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10thcipriani) >>! In T302231#7730446, @Legoktm wrote: >>>! In T302231#7729265, @Urbanecm wrote: >>>>! In T302231#7729155, @Ladsgroup wrote: >>> Well, I think we should be more inclusive i... [17:16:38] (03CR) 10Razzi: [C: 03+2] dbproxy: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/772443 (https://phabricator.wikimedia.org/T302233) (owner: 10Razzi) [17:20:08] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 22 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:20:22] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:22:49] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@2b67de7]: 0.3.107 (duration: 08m 26s) [17:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:36] (03CR) 10JMeybohm: [C: 04-1] "I would also recommend you add some fixtures which are basically values files the chart will be rendered with in CI to get some test cover" [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [17:25:01] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:51] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:54] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [17:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:00] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [17:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:27:10] (03CR) 10Ryan Kemper: "Deploying this per Yiannis' request in #wikimedia-operations" [puppet] - 10https://gerrit.wikimedia.org/r/772442 (owner: 10Jgiannelos) [17:27:17] (03CR) 10Ryan Kemper: [C: 03+2] maps: Temporarily disable OSM sync for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/772442 (owner: 10Jgiannelos) [17:30:29] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10jbond) p:05Triage→03Medium [17:31:26] !log [Maps] Ran puppet agent on maps master `maps1009` to verify puppet patch works; looks like osm import was disabled as intended `Notice: /Stage[main]/Osm::Imposm3/Systemd::Service[imposm]/Service[imposm]/ensure: ensure changed 'running' to 'stopped'` [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:19] !log [Maps] Running puppet agent on rest of `maps*`: `ryankemper@cumin1001:~$ sudo -E cumin -b 4 'maps*' 'run-puppet-agent'` [17:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:44] nemo-yiannis: okay puppet agent runs will be done on maps in a minute or two [17:32:59] sounds good, thanks [17:34:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:35:24] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1018 after same command without `--table` argument timed out waiting for `zhwiki_p.page` [17:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:42] (maps puppet agent runs are done) [17:35:49] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:36:00] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10MarkAHershberger) Ms. Francis, I haven't moved since I was a WMF employee about 10 years ago, but, in any case: Mark A. Hershbe... [17:36:52] (03PS1) 10Razzi: dbproxy: repool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/772447 (https://phabricator.wikimedia.org/T302233) [17:37:54] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@2b67de7] (wcqs): Deploy 0.3.107 to WCQS [17:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] !log [WCQS Deploy] Tests look good following deploy of `0.3.107` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet [17:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:08] (03CR) 10Razzi: [C: 03+2] dbproxy: repool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/772447 (https://phabricator.wikimedia.org/T302233) (owner: 10Razzi) [17:39:51] I think Mark Hershberger just doxed himself. [17:39:55] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:40:05] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) [17:40:06] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@2b67de7] (wcqs): Deploy 0.3.107 to WCQS (duration: 02m 12s) [17:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:23] !log [WCQS Deploy] Test query passed on commons-query.wikimedia.org; WCQS deploy complete [17:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:05] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:43:55] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) [17:44:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:44:08] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) @TThoabala can you explain what access you require. i.e. which commands do yu expect to run and from where @KFrancis could you please help confirm or arrange an NDA fo... [17:46:40] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1014 [17:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:31] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1018 for T302233 [17:49:33] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:36] T302233: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 [17:49:44] (03PS1) 10Filippo Giunchedi: nagios: quote check_http url/string parameters [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) [17:49:56] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1013 for T302233 [17:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:39] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) p:05Triage→03Medium [17:51:03] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1017 for T302233 [17:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Per comment thread" [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [17:53:36] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [17:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:49] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bu... [17:55:56] !log otto@deploy1002 Started deploy [analytics/refinery@2175d63] (hadoop-test): gobblin prometheus metrics for all jobs - T294420 [17:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] T294420: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 [17:57:07] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1016 for T302233 [17:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:11] T302233: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 [17:59:25] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1020 for T302233 [17:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:56] !log `sudo maintain-views --all-databases --replace-all --table flaggedrevs` on clouddb1021 for T302233 [17:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] dancy, hashar, brennen, dduvall, jeena, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) 🚂🧪Trainsperiment Week Deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1800). [18:00:04] dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor I � Unicode. All rise for 🚂🧪Trainsperiment Week Deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1800). [18:00:04] dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for 🚂🧪Trainsperiment Week Deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T1800). [18:00:27] o/ [18:00:37] logs seem clean, will be rolling forward shortly. [18:02:39] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772450 [18:02:41] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772450 (owner: 10Brennen Bearnes) [18:03:15] !log otto@deploy1002 Finished deploy [analytics/refinery@2175d63] (hadoop-test): gobblin prometheus metrics for all jobs - T294420 (duration: 07m 19s) [18:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:19] T294420: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 [18:03:26] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.1 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772450 (owner: 10Brennen Bearnes) [18:04:37] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.1 refs T300203 [18:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:41] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [18:05:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:05:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22903 and previous config saved to /var/cache/conftool/dbconfig/20220321-180526-marostegui.json [18:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:31] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [18:05:49] (03PS1) 10RLazarus: miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/772451 (https://phabricator.wikimedia.org/T300324) [18:06:27] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10wiki_willy) a:03RobH [18:06:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:12:05] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:47] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:14:59] (03Abandoned) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/770605 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [18:15:08] !log otto@deploy1002 Started deploy [analytics/refinery@2175d63]: gobblin prometheus metrics for all jobs - T294420 [18:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:14] T294420: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 [18:15:39] hi, I would like to revert a previous revert, do I need to submit a new patch? [18:16:03] (in the backport window two hours later [18:16:39] koi: Every deployment needs a patch to deploy :) [18:16:39] koi: you can click the revert button on the revert [18:16:41] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:16:42] But yes [18:17:46] (03PS1) 10Stang: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 [18:17:59] got it, thanks [18:19:36] !log trainsperiment (T300203): 1.39.0-wmf.1 on all wikis; starting prep of wmf.2, will abort if needed [18:19:37] (03PS1) 10Jgiannelos: maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 [18:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:40] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [18:19:49] !log otto@deploy1002 Finished deploy [analytics/refinery@2175d63]: gobblin prometheus metrics for all jobs - T294420 (duration: 04m 41s) [18:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:57] (03CR) 10jerkins-bot: [V: 04-1] maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 (owner: 10Jgiannelos) [18:20:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22904 and previous config saved to /var/cache/conftool/dbconfig/20220321-182032-marostegui.json [18:20:32] (03PS2) 10Jgiannelos: maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 [18:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:25] (03CR) 10Jgiannelos: "Re-enabling OSM replication on codfw after re-indexing" [puppet] - 10https://gerrit.wikimedia.org/r/772453 (owner: 10Jgiannelos) [18:21:40] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772454 [18:21:42] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772454 (owner: 10Brennen Bearnes) [18:22:22] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772454 (owner: 10Brennen Bearnes) [18:22:23] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.2 refs T300203 [18:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:53] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10KFrancis) @jbond As TsepoThoabala is an employee of the WMF (tthoabala@wikimedia.org), no separate NDA is needed. [18:24:10] (03PS4) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [18:24:26] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) >>! In T303398#7793987, @KFrancis wrote: > @jbond As TsepoThoabala is an employee of the WMF (tthoabala@wikimedia.org), no separate NDA is needed. Doh! sorry i missed... [18:25:23] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) [18:25:51] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10KFrancis) @jbond No worries! :-) [18:26:00] (03PS1) 10Ottomata: gobblin - use gobblin-wmf-core-1.0.1 in analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/772456 (https://phabricator.wikimedia.org/T297939) [18:26:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] gobblin - use gobblin-wmf-core-1.0.1 in analytics-hadoop [puppet] - 10https://gerrit.wikimedia.org/r/772456 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [18:28:13] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:31:03] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:32:31] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) @Tchanders i have checked and you are in the `deployment` im guessing i should configure @TThoabala with the same access? [18:34:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:29] (03CR) 10Ebernhardson: [C: 03+1] [wdqs] test jvmquake options on the public cluster [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [18:35:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22905 and previous config saved to /var/cache/conftool/dbconfig/20220321-183537-marostegui.json [18:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:42] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Krinkle) a:03Krinkle [18:36:28] 10SRE, 10SRE-Access-Requests, 10Security: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10sbassett) [18:37:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:49] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:45:33] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:45:46] 10SRE, 10SRE-Access-Requests, 10Security: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10MarkAHershberger) Thanks for handling this so quickly. I thought I had opened an email to Katie and was replying to it. [18:47:12] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) Base on the list @Andrew provided me on IRC (1016, 1017, 1019, 1022, 1023) he was able to re-image those hosts and base o... [18:49:07] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10MSantos) a:05JMinor→03None [18:49:15] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10MSantos) a:03MSantos [18:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298557)', diff saved to https://phabricator.wikimedia.org/P22906 and previous config saved to /var/cache/conftool/dbconfig/20220321-185042-marostegui.json [18:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [18:51:07] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:54:14] !log T303548 start commonswiki reindexing on eqiad codfw and cloudelastic cirrus clusters [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:19] T303548: CirrusSearchIndexTooOld - https://phabricator.wikimedia.org/T303548 [18:54:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:57:03] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:57:32] (03PS1) 10MSantos: maps: allow bbcrewind to access maps public urls [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968) [19:05:12] (03PS1) 10SBassett: Revert "Set StopForumSpam to enforce on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772468 [19:05:39] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:58] (03PS2) 10SBassett: Revert "Set StopForumSpam to enforce on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772468 (https://phabricator.wikimedia.org/T304111) [19:07:32] (03CR) 10SBassett: [C: 03+2] Revert "Set StopForumSpam to enforce on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772468 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [19:08:13] (03Merged) 10jenkins-bot: Revert "Set StopForumSpam to enforce on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772468 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [19:11:13] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:16] (03CR) 10Eigyan: "Awight and Jdlrobson can you share your thoughts on this patch and whether it is ok for the eswiki percieved-performance-survey to be over" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [19:19:17] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:34] (03CR) 10RLazarus: [C: 03+2] miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/772451 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:22:39] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:12] (03Merged) 10jenkins-bot: miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/772451 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:26:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:57] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.2 refs T300203 (duration: 64m 33s) [19:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:01] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:28:39] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:31:53] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:05] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Zabe) [19:32:27] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:34:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:36:27] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:37] PROBLEM - PHP opcache health on mw1417 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:37:32] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10bking) [19:38:05] PROBLEM - PHP opcache health on mw1448 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:38:09] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:13] PROBLEM - PHP opcache health on mw1449 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:40:08] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772494 [19:40:09] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772494 (owner: 10Brennen Bearnes) [19:40:27] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:41:00] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772494 (owner: 10Brennen Bearnes) [19:42:11] PROBLEM - PHP opcache health on mw1415 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:42:15] PROBLEM - PHP opcache health on mw1414 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:42:29] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.2 refs T300203 [19:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:33] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:42:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:43:18] hopefully the php-fpm restarts at the end of the scap sync-wikiversions took care of opcache fullness... [19:45:03] PROBLEM - PHP opcache health on mw1416 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:45:39] (03CR) 10Awight: [C: 03+1] "The messages are already available, which is great. Survey configurations look good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [19:46:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:48:24] !log mw1416: sudo -i /usr/local/sbin/restart-php7.2-fpm [19:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:38] thcipriani, brennen: around in case you need anything :) your show though [19:50:41] 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Sgs) [19:50:45] RECOVERY - PHP opcache health on mw1416 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:51:43] (03CR) 10Jdlrobson: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [19:54:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:56:31] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:33] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:04] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T2000). [20:00:04] zabe and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] o/ [20:00:19] o/ [20:00:57] I can deploy today [20:01:19] urbanecm: go ahead; we'll hold next train rollout 'til after window [20:01:26] thank you brennen [20:01:32] and fingers crossed on the experiment [20:01:39] (trainsperiment? or whatever it's called officially) [20:01:50] :) [20:03:27] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:04:37] koi: hello, I'm sorry, but i will not deploy your patch today, as a) changes to AJAx allowlist are generally considered sensitive (and aren't deployed w/o a +1; a +1 is also needed from someone on the secteam) b) previous version of the patch was reverted because it "didn't work" (quoting commit message) [20:05:38] (03CR) 10Urbanecm: [C: 04-1] "current code sets a variable twice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:06:17] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:06:20] (03CR) 10Urbanecm: Migrate reads from wmfDbconfigFromEtcd to wmgDbconfigFromEtcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:06:28] zabe: left a comment on both your patches -- can you have a look? [20:07:36] urbanecm: I actually means "not work for all scripts" at that time, but yes, I will ask someone from security team to have a review again [20:08:05] koi: yeah, at this stage (patch, revert, re-revert), I'd really prefer to have a recorded discussion at the ticket with "let's redeploy" as a consensus :) [20:08:20] (03CR) 10Zabe: ExtensionDistributor: Add REL1_38 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:08:49] got it, anyway, thanks urbanecm [20:08:49] oh [20:09:12] zabe: my bad reading. wgExtDistDefaultSnapshot and wgExtDistCandidateSnapshot is so similar in my eyes [20:10:06] (03CR) 10Urbanecm: ExtensionDistributor: Add REL1_38 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:10:43] (03CR) 10Stang: [C: 04-1] "on hold" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (owner: 10Stang) [20:10:59] (03PS2) 10Urbanecm: ExtensionDistributor: Add REL1_38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:11:30] (03PS4) 10Zabe: Migrate away from $wmfDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) [20:11:43] (03CR) 10Urbanecm: [C: 03+2] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:11:51] (03CR) 10Zabe: Migrate away from $wmfDbconfigFromEtcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:12:06] urbanecm, happens :) [20:12:30] (03Merged) 10jenkins-bot: ExtensionDistributor: Add REL1_38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771963 (https://phabricator.wikimedia.org/T304185) (owner: 10Zabe) [20:12:49] better to mention a wrong issue than to not mention a good issue :) [20:14:23] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:15:15] noting that we're holding the train on T304353 [20:15:15] T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353 [20:15:40] * urbanecm is having some internet connection issues :( [20:16:18] should be resolved now [20:16:21] zabe: your patch is at mwdebug1001 [20:16:23] please test [20:17:57] urbanecm, lgtm, I can select 1.38 at Special:ExtensionDistributor as 'next stable candidate', but 1.37 is still the default selection [20:18:06] so, let's sync i guess? [20:18:11] yes [20:18:13] doing [20:19:17] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 8347de5: ExtensionDistributor: Add REL1_38 (T304185) (duration: 00m 51s) [20:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:22] T304185: Add REL1_38 to ExtensionDistributor as development snapshot - https://phabricator.wikimedia.org/T304185 [20:19:36] zabe: live [20:20:45] (03CR) 10Urbanecm: [C: 03+2] Migrate away from $wmfDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:20:49] (03PS5) 10Urbanecm: Migrate away from $wmfDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:20:56] (03CR) 10Urbanecm: [C: 03+2] Migrate away from $wmfDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:21:41] PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:49] (03Merged) 10jenkins-bot: Migrate away from $wmfDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:22:58] zabe: pulled to mwdebug1001. can you test? [20:23:01] yep [20:23:07] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:24:18] and also, it looks the changes can be synced in any order -- is that right? [20:25:10] urbanecm, lgtm, logstash is clearand yes they can be synced in any order [20:25:25] thanks for confirming that. in that case, syncing [20:26:23] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:28:59] !log urbanecm@deploy1002 Synchronized wmf-config/etcd.php: 3bcccdc: Migrate away from $wmfDbconfigFromEtcd (T45956; 1/2) (duration: 00m 50s) [20:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:15] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:29:18] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:29:50] !log urbanecm@deploy1002 Synchronized docroot/noc/db.php: 3bcccdc: Migrate away from $wmfDbconfigFromEtcd (T45956; 2/2) (duration: 00m 50s) [20:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:34] zabe: should be all done [20:30:44] and with koi's patch postponed, i think the window's done too [20:30:49] thanks [20:30:53] hth [20:30:54] thanks [20:30:59] no problem koi [20:31:06] !log UTC late backport window completed [20:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:21] brennen: in case you have anything trainy to do, floor is yours [20:34:35] urbanecm: thanks! on a different note, any thoughts on T304353? [20:34:35] T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353 [20:34:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:35:07] brennen: well, looks i have an UBN in my own code :) [20:35:10] I'll have a look [20:35:18] ty! [20:35:19] thanks for the ping [20:36:39] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:15] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:40:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:41:29] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Tchanders) @jbond That would be great - thank you. @TThoabala is now on leave for a few months, so would it be more convenient to stall this until he returns? [20:43:09] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:43:11] urbanecm: i think we're probably going to remove that as a blocker; we let it slip through in wmf.1 and rolling wmf.2 forward won't change the error rate any. [20:43:27] brennen: ack ack. thanks. [20:43:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:43:41] urbanecm: if there are patches I and jnuche are here tomorrow morning (european tz relative) [20:44:49] hashar: thanks for letting me know. Does that mean I should _not_ be self-backporting the fix (once it exists) for some reason? [20:45:33] urbanecm: oh you can. But it must be late for you isn't it ? [20:46:01] (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772501 [20:46:03] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772501 (owner: 10Dduvall) [20:47:06] hashar: tomorrow European morning? I'm in UTC+1, so it's the morning windows are actual mornings for me :) [20:47:34] same here :] [20:47:37] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772501 (owner: 10Dduvall) [20:49:01] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:49:04] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.2 refs T300203 [20:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:09] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [20:49:21] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:49:55] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.2 refs T300203 (duration: 00m 51s) [20:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:22] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [20:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:42] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O... [20:53:13] (03PS1) 10Dduvall: all wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772503 [20:53:15] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772503 (owner: 10Dduvall) [20:54:54] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.2 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772503 (owner: 10Dduvall) [20:55:39] I'd like to do one more noisy envoy upgrade, whenever deployments are in a state where nobody will mind me spamming in here for a little bit :) [20:56:18] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.2 refs T300203 [20:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:22] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [20:56:26] rzl: give us ~5m? [20:56:30] sure thing [21:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T2100). [21:03:52] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye [21:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:02] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu... [21:10:00] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [21:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O... [21:11:14] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:34] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:12:08] 10SRE, 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10Cmjohnson) 05Open→03Resolved db1175 had a correctable DIMM error, the issue during the reboot is that the DIMM was A1 and that caused it to fail during post. I had to take the server down to 2 DIMM to figure that out,... [21:15:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:16:44] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:48] (03PS5) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [21:19:16] brennen: am I okay to go ahead? [21:21:14] (03CR) 10jerkins-bot: [V: 04-1] [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [21:23:05] (03PS6) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [21:24:33] (03CR) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [21:24:36] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) [21:24:51] going ahead :) [21:24:57] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [21:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:22] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [21:25:23] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [21:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:55] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [21:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:39] (03PS1) 10Jdrewniak: Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) [21:26:48] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [21:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:02] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:27:08] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:19] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [21:27:20] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [21:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:12] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [21:28:13] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [21:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:34] (03CR) 10CDanis: maps: allow bbcrewind to access maps public urls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968) (owner: 10MSantos) [21:29:54] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:30:38] (03PS11) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [21:30:38] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [21:30:40] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [21:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:43] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [21:31:44] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [21:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:48] (03PS2) 10Ryan Kemper: Add Cumin alias for search-loader [puppet] - 10https://gerrit.wikimedia.org/r/772326 (https://phabricator.wikimedia.org/T258189) (owner: 10Muehlenhoff) [21:33:15] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [21:33:16] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [21:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:51] (03CR) 10Jdlrobson: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [21:34:08] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:34:44] (03PS12) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [21:34:53] (03CR) 10Bking: [C: 03+2] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:35:08] (03CR) 10Bking: [V: 03+2 C: 03+2] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:36:43] (03CR) 10Ryan Kemper: [C: 03+2] Add Cumin alias for search-loader [puppet] - 10https://gerrit.wikimedia.org/r/772326 (https://phabricator.wikimedia.org/T258189) (owner: 10Muehlenhoff) [21:36:51] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [21:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:29] (03CR) 10Bking: elasticsearch: remove custom restart handling (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:39:32] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:29] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [21:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:11] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [21:41:12] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [21:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:32] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:41:46] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:41:53] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [21:41:54] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [21:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:12] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:43:22] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [21:43:23] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [21:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:22] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [21:44:23] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [21:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:01] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [21:45:02] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [21:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:44] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [21:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:48] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [21:46:08] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [21:46:09] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [21:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [21:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:03] rzl: sorry i missed yr ping earlier. too much windows & tabs & buffers open. :) [21:47:13] haha no worries [21:48:36] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:52:56] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 49 threshold =0.15 breach: number_of_data_nodes: 2, task_max_waiting_in_queue_millis: 0, timed_out: False, number_of_pending_tasks: 0, active_primary_shards: 163, relocating_shards: 0, cluster_name: relforge-eqiad, active_shards: 244, number_of_in_flight_fetch: 0, status: yellow, initializing_shards: 0, active_shar [21:52:56] nt_as_number: 83.27645051194538, delayed_unassigned_shards: 0, number_of_nodes: 2, unassigned_shards: 49 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:53:12] (03PS1) 10Reedy: Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772473 (https://phabricator.wikimedia.org/T304350) [21:53:38] (03PS1) 10Reedy: Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772474 (https://phabricator.wikimedia.org/T304350) [21:53:56] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 49 threshold =0.15 breach: timed_out: False, number_of_pending_tasks: 0, initializing_shards: 0, number_of_data_nodes: 2, relocating_shards: 0, active_primary_shards: 163, active_shards_percent_as_number: 83.27645051194538, status: yellow, number_of_in_flight_fetch: 0, active_shards: 244, cluster_name: relforge-eqi [21:53:56] _max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, unassigned_shards: 49, number_of_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:54:01] (03CR) 10Reedy: [C: 03+2] Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772474 (https://phabricator.wikimedia.org/T304350) (owner: 10Reedy) [21:54:09] (03CR) 10Reedy: [C: 03+2] Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772473 (https://phabricator.wikimedia.org/T304350) (owner: 10Reedy) [21:55:20] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:56:50] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:57:06] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:28] Relforge alert above can be ignored, forgot to downtime [21:59:15] !log T301955 Downtimed relforge for 2 days; stuck in yellow status during upgrade b/c replica shards cannot be scheduled to a host of lower elasticsearch version than primary shards. Working on patch for our `rolling-operation` cookbook to disable replication during operation [21:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:21] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [22:00:00] (03Merged) 10jenkins-bot: Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772474 (https://phabricator.wikimedia.org/T304350) (owner: 10Reedy) [22:00:03] (03Merged) 10jenkins-bot: Disable user only after it has been removed from the db [extensions/OATHAuth] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772473 (https://phabricator.wikimedia.org/T304350) (owner: 10Reedy) [22:00:20] jouncebot: nowandnext [22:00:20] For the next 0 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220321T2100) [22:00:20] In 3 hour(s) and 59 minute(s): Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0200) [22:03:14] !log reedy@deploy1002 Synchronized php-1.39.0-wmf.1/extensions/OATHAuth/: T304350 (duration: 00m 49s) [22:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:18] T304350: DisableOATHForUser: Error: Call to a member function getName() on null - https://phabricator.wikimedia.org/T304350 [22:04:00] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:04:10] !log reedy@deploy1002 Synchronized php-1.39.0-wmf.2/extensions/OATHAuth/: T304350 (duration: 00m 49s) [22:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:48] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:09:37] (03PS3) 10Aaron Schulz: Simplify comments and stubs for etcd-defined DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 [22:11:14] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:12:36] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:17:16] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:20:04] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:20:36] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:23:30] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:25:30] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:26:39] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:43] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [22:28:22] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: number_of_in_flight_fetch: 0, number_of_data_nodes: 2, active_primary_shards: 163, unassigned_shards: 0, status: green, timed_out: False, number_of_pending_tasks: 0, active_shards: 293, number_of_nodes: 2, initializing_shards: 0, relocating_shards: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, c [22:28:22] ame: relforge-eqiad, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:28:44] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:12] !log T301955 Lifted downtime on relforge now that cluster upgrade is complete and cluster is back to green status [22:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:18] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: status: green, cluster_name: relforge-eqiad, relocating_shards: 0, initializing_shards: 0, active_primary_shards: 163, number_of_in_flight_fetch: 0, number_of_data_nodes: 2, active_shards: 293, number_of_pending_tasks: 0, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_mi [22:30:18] unassigned_shards: 0, timed_out: False, number_of_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:40] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:34:58] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:35:52] RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:37:32] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:39:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:14] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:46] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:41:20] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:45:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:48:28] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:48:52] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:49:24] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:51:38] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:52:54] (03CR) 10Ottomata: [C: 03+1] Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak) [22:55:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:36] RECOVERY - PHP opcache health on mw1417 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:57:28] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:00] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:46] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:03:23] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10KFrancis) Thanks @jbond and @MarkAHershberger. The NDA agreement has been sent to you for signature via DocuSign. [23:04:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:04:46] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:06:10] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:08:26] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:34] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:11:12] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:15:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:19:24] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:19:54] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:20:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:21:14] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:42] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:33:42] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:38:54] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:39:04] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:28] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:41:34] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:46:44] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @jbond Sorry if I missed this in the thread, but please confirm if an NDA is still needed here. Thanks!!! [23:47:20] RECOVERY - MD RAID on ganeti2013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:52:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:55:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:55:48] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:56:40] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:58:18] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:58:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status