[00:11:39] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:51:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:24:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:35] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:59] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:10:43] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:48] (03PS1) 10Andrew Bogott: wmcs/nfs/add_server: specify mount options tuned for NFS [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752367 [03:58:45] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Rockmikee) You can't change yesterday, but if you are too worry tomorrow, will ruin today. [[ https://www.newhotplaza.com/ecommerce | sponsored links ]] [04:06:23] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:16:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:20:02] PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [04:23:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:56] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:34:02] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 47.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:34:44] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 36.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:36:52] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:37:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [04:38:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 103 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:33:22] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [05:40:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [05:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18423 and previous config saved to /var/cache/conftool/dbconfig/20220110-054100-marostegui.json [05:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:04] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [05:41:39] (03PS8) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [05:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18424 and previous config saved to /var/cache/conftool/dbconfig/20220110-054410-marostegui.json [05:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:22] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:47:48] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_updatequerypages_mostlinked_s3@13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:14] (03PS1) 10Ladsgroup: Use PreparedUpdate to avoid double parse [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752270 (https://phabricator.wikimedia.org/T288639) [05:48:20] (03CR) 10Ladsgroup: [C: 03+2] Use PreparedUpdate to avoid double parse [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752270 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [05:50:46] (03PS1) 10Marostegui: dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752375 (https://phabricator.wikimedia.org/T298586) [05:51:22] (03PS9) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [05:51:59] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752375 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [05:52:21] (03CR) 10Winston Sung: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [05:52:44] (03CR) 10Winston Sung: [C: 03+1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [05:53:26] (03PS1) 10Marostegui: install_server: Allow reimage for dbproxy1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/752403 (https://phabricator.wikimedia.org/T298586) [05:54:16] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage for dbproxy1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/752403 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [05:58:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1013.eqiad.wmnet with OS bullseye [05:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P18425 and previous config saved to /var/cache/conftool/dbconfig/20220110-055915-marostegui.json [05:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:13] (03Merged) 10jenkins-bot: Use PreparedUpdate to avoid double parse [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752270 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [06:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P18426 and previous config saved to /var/cache/conftool/dbconfig/20220110-061420-marostegui.json [06:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:00] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:752270|Use PreparedUpdate to avoid double parse (T288639)]] (duration: 01m 00s) [06:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:02] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [06:19:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:32] (03CR) 10Legoktm: "LGTM, one nit inline. Please bump this once it's safe to merge/deploy (so we don't have a failing systemd unit showing up in alerts)" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [06:23:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1013.eqiad.wmnet with OS bullseye [06:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:34] (03PS1) 10Marostegui: dbproxy1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752528 (https://phabricator.wikimedia.org/T298586) [06:29:13] (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752528 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [06:29:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18427 and previous config saved to /var/cache/conftool/dbconfig/20220110-062925-marostegui.json [06:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:29:28] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [06:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T297191)', diff saved to https://phabricator.wikimedia.org/P18428 and previous config saved to /var/cache/conftool/dbconfig/20220110-062934-marostegui.json [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1014.eqiad.wmnet with OS bullseye [06:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T297191)', diff saved to https://phabricator.wikimedia.org/P18429 and previous config saved to /var/cache/conftool/dbconfig/20220110-063042-marostegui.json [06:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P18430 and previous config saved to /var/cache/conftool/dbconfig/20220110-064546-marostegui.json [06:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:16] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:55:40] PROBLEM - DNS on ganeti1023.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.207 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1014.eqiad.wmnet with OS bullseye [06:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:33] (03PS1) 10Marostegui: Revert "dbproxy1014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752271 [07:00:17] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752271 (owner: 10Marostegui) [07:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P18431 and previous config saved to /var/cache/conftool/dbconfig/20220110-070051-marostegui.json [07:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:42] (03PS1) 10Marostegui: wmnet: Failover m1 proxy from dbproxy1012 to dbproxy104 [dns] - 10https://gerrit.wikimedia.org/r/752532 (https://phabricator.wikimedia.org/T298586) [07:08:08] PROBLEM - DNS on mw1455.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.195 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:09:00] PROBLEM - DNS on mw1454.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.194 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:12:00] PROBLEM - DNS on mw1456.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.196 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:15:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T297191)', diff saved to https://phabricator.wikimedia.org/P18432 and previous config saved to /var/cache/conftool/dbconfig/20220110-071556-marostegui.json [07:15:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:16:00] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T297191)', diff saved to https://phabricator.wikimedia.org/P18433 and previous config saved to /var/cache/conftool/dbconfig/20220110-071603-marostegui.json [07:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1 proxy from dbproxy1012 to dbproxy104 [dns] - 10https://gerrit.wikimedia.org/r/752532 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [07:16:37] !log Failover m1 proxy from dbproxy1012 to dbproxy1014 T298586 [07:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:39] T298586: Upgrade all dbproxy hosts to Bullseye - https://phabricator.wikimedia.org/T298586 [07:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T297191)', diff saved to https://phabricator.wikimedia.org/P18434 and previous config saved to /var/cache/conftool/dbconfig/20220110-071711-marostegui.json [07:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P18435 and previous config saved to /var/cache/conftool/dbconfig/20220110-073216-marostegui.json [07:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:28] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:32] PROBLEM - DNS on mw1453.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.193 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:47:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P18436 and previous config saved to /var/cache/conftool/dbconfig/20220110-074720-marostegui.json [07:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:30] (03PS2) 10Muehlenhoff: Switch Kunal to volunteer NDA status [puppet] - 10https://gerrit.wikimedia.org/r/751210 [07:55:04] PROBLEM - DNS on db1161.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.174 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T297191)', diff saved to https://phabricator.wikimedia.org/P18437 and previous config saved to /var/cache/conftool/dbconfig/20220110-080225-marostegui.json [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Maintenance [08:02:30] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Maintenance [08:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T297191)', diff saved to https://phabricator.wikimedia.org/P18438 and previous config saved to /var/cache/conftool/dbconfig/20220110-080236-marostegui.json [08:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T297191)', diff saved to https://phabricator.wikimedia.org/P18439 and previous config saved to /var/cache/conftool/dbconfig/20220110-080344-marostegui.json [08:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:41] !log Drop table wikishared.wikimedia_editor_tasks_targets_passed T264225 [08:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] T264225: Drop table wikimedia_editor_tasks_targets_passed on wmf wikis - https://phabricator.wikimedia.org/T264225 [08:18:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P18440 and previous config saved to /var/cache/conftool/dbconfig/20220110-081849-marostegui.json [08:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:04] (03CR) 10ArielGlenn: "I fixed a typo and tweaked the wording a bit. I would be fine to merge and deploy this if you agree with the changes." [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [08:23:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:25:20] !log migrating primary/secondary instances off ganeti2023 [08:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:10] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:30:18] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [08:31:39] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) >>! In T283582#7606038, @Dzahn wrote: > for the record: I have absolutel... [08:33:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P18441 and previous config saved to /var/cache/conftool/dbconfig/20220110-083354-marostegui.json [08:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:20] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: make the default grace period 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) [08:37:09] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: make the default grace period 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:37:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [08:39:21] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) >>! In T57503#7587906, @Kelson wrote: > @ArielGlenn Any chance this ticket could be implemented some time? It seems as well... [08:48:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T297191)', diff saved to https://phabricator.wikimedia.org/P18442 and previous config saved to /var/cache/conftool/dbconfig/20220110-084858-marostegui.json [08:49:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:49:02] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T297191)', diff saved to https://phabricator.wikimedia.org/P18443 and previous config saved to /var/cache/conftool/dbconfig/20220110-084912-marostegui.json [08:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:21] (03CR) 10RhinosF1: [C: 03+1] Update static html dump index.html to mention Wikimedia Enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [08:50:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T297191)', diff saved to https://phabricator.wikimedia.org/P18444 and previous config saved to /var/cache/conftool/dbconfig/20220110-085020-marostegui.json [08:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:44] apergos: feel free to merge! [08:52:31] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks for the patch! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/752022 (https://phabricator.wikimedia.org/T298815) (owner: 10RhinosF1) [08:52:35] (03CR) 10Muehlenhoff: [C: 03+2] cross-validate-accounts: add deployment-ci-admins to ops expected list [puppet] - 10https://gerrit.wikimedia.org/r/752022 (https://phabricator.wikimedia.org/T298815) (owner: 10RhinosF1) [08:53:02] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: make the default grace period 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) [08:53:44] (03CR) 10Jelto: [C: 04-1] "PCC fails currently, see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [08:54:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove all groups from s7 codfw T263127', diff saved to https://phabricator.wikimedia.org/P18445 and previous config saved to /var/cache/conftool/dbconfig/20220110-085402-marostegui.json [08:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:06] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [08:54:11] No problem moritzm [08:54:38] (03CR) 10ArielGlenn: [C: 03+2] Update static html dump index.html to mention Wikimedia Enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [08:54:48] 10SRE, 10Discovery-Search (Current work): Consider filesystem/disk based improvements on WQDS servers - https://phabricator.wikimedia.org/T298570 (10Joe) XFS has both advantages and disadvantages, including in terms of data safety. In fact, I think the data persistence team has used xfs in the past but switche... [08:54:59] I also got https://gerrit.wikimedia.org/r/c/operations/puppet/+/752018/3/modules/profile/files/sre/check_user.py left [08:55:03] Also thanks apergos [08:56:16] moritzm: may I merge your puppet changes through? [08:56:36] cross-validate-accounts.py specifically [08:57:00] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [08:57:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1] envoy: make the choice of api version explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [08:57:53] apergos: please go ahead, I was about to [08:58:24] done [09:01:55] thx [09:05:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P18446 and previous config saved to /var/cache/conftool/dbconfig/20220110-090525-marostegui.json [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:31] (03PS1) 10Muehlenhoff: Remove LDAP access for marcella [puppet] - 10https://gerrit.wikimedia.org/r/752604 [09:10:18] (03CR) 10ArielGlenn: "I don't want to remove this right now; we're restructuring the whole dumpsdata server layout, at the end of which we can toss everything t" [puppet] - 10https://gerrit.wikimedia.org/r/751693 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:18:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for marcella [puppet] - 10https://gerrit.wikimedia.org/r/752604 (owner: 10Muehlenhoff) [09:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P18447 and previous config saved to /var/cache/conftool/dbconfig/20220110-092029-marostegui.json [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:04] (03PS1) 10Gergő Tisza: linkrecommendation: Add MEDIAWIKI_PROXY_API_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/752608 (https://phabricator.wikimedia.org/T298857) [09:26:01] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [09:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions group from s7 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18448 and previous config saved to /var/cache/conftool/dbconfig/20220110-092605-marostegui.json [09:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [09:27:55] 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10elukey) [09:28:10] 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10elukey) [09:32:03] (03PS1) 10Elukey: role::kafka::main: apply kafka fixed uid/gid to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/752613 (https://phabricator.wikimedia.org/T296641) [09:33:00] (03PS2) 10Elukey: role::kafka::main: apply kafka fixed uid/gid to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/752613 (https://phabricator.wikimedia.org/T296641) [09:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T297191)', diff saved to https://phabricator.wikimedia.org/P18449 and previous config saved to /var/cache/conftool/dbconfig/20220110-093534-marostegui.json [09:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:38] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [09:42:34] 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10ayounsi) p:05Triage→03High [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-B-phase-Z on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-B-phase-Y on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-B-phase-X on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-A-phase-Z on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-A-phase-Y on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - ps1-a8-eqiad-infeed-load-tower-A-phase-X on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:53] ACKNOWLEDGEMENT - SSH on pki1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:54] ACKNOWLEDGEMENT - Juniper alarms on msw1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.10 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:43:54] ACKNOWLEDGEMENT - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. [09:43:55] ACKNOWLEDGEMENT - SSH on db1129.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:55] ACKNOWLEDGEMENT - SSH on db1117.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:56] ACKNOWLEDGEMENT - SSH on db1111.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:43:13. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:44:58] ACKNOWLEDGEMENT - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:44:46. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:44:58] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:44:46. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:24] (03PS2) 10Giuseppe Lavagetto: envoy: make the choice of api version explicit [puppet] - 10https://gerrit.wikimedia.org/r/751717 [09:45:26] (03PS2) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [09:46:33] ACKNOWLEDGEMENT - DNS on ganeti1023.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.207 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:46:24. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:33] ACKNOWLEDGEMENT - DNS on mw1453.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.193 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:46:24. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:33] ACKNOWLEDGEMENT - DNS on mw1454.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.194 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:46:24. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:33] ACKNOWLEDGEMENT - DNS on mw1455.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.195 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:46:24. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:33] ACKNOWLEDGEMENT - DNS on mw1456.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.196 ayounsi T298869 - The acknowledgement expires at: 2022-01-11 09:46:24. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:37] (03CR) 10jerkins-bot: [V: 04-1] services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [09:55:00] (03Abandoned) 10David Caro: {p,r}:dumps:generation:sever:alldumps: remove usused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751693 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:56:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33162/console" [puppet] - 10https://gerrit.wikimedia.org/r/752613 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [09:56:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:56:50] !log migrating primary/secondary instances off ganeti2019 [09:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:18] (03PS1) 10ArielGlenn: remove snapshot02 from deployment prep scap and update domian name there too [dumps/scap] - 10https://gerrit.wikimedia.org/r/752622 [10:14:38] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10ayounsi) a quick look on GitHub shows 2 approaches: * This one parses the firmware page: https://g... [10:16:22] !log removing echo objectcache entries on all wikis (T272512) [10:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:25] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [10:16:46] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751952 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [10:22:14] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] role::kafka::main: apply kafka fixed uid/gid to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/752613 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [10:34:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:51] !log stop/start kafka daemons on kafka-main1* nodes to move the kafka user to fixed uid/gid - T296641 [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:54] T296641: Upgrade kafka-main nodes to buster - https://phabricator.wikimedia.org/T296641 [10:39:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:13] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] remove snapshot02 from deployment prep scap and update domian name there too [dumps/scap] - 10https://gerrit.wikimedia.org/r/752622 (owner: 10ArielGlenn) [10:39:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:39:23] 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) A revised version is now merged upstream. Probably best to just wait until this gets into Debian, but it is a client-side patch, so we could deploy a patched client at reasonably l... [10:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18451 and previous config saved to /var/cache/conftool/dbconfig/20220110-104004-marostegui.json [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:07] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:40:21] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [10:40:41] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: apply kafka fixed uid/gid to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/752613 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [10:44:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager group from s7 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18452 and previous config saved to /var/cache/conftool/dbconfig/20220110-104445-marostegui.json [10:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:49] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:45:19] (03CR) 10David Caro: check_haproxy: improve failover output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [10:46:28] (03CR) 10Jbond: [C: 04-1] "thanks for the fix but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) (owner: 10RhinosF1) [10:49:00] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:14] !log installing openjdk-11 security updates [10:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18453 and previous config saved to /var/cache/conftool/dbconfig/20220110-105529-marostegui.json [10:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:33] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:58:53] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10ayounsi) a:05ayounsi→03RobH We usually use the FQDN for logging and NTP endpoints, see https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/ServerTech#Setting_up_the_Configurat... [11:07:45] (03PS1) 10Muehlenhoff: Extend logstash Cumin alias with new Opensearch roles [puppet] - 10https://gerrit.wikimedia.org/r/752631 [11:07:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [11:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18454 and previous config saved to /var/cache/conftool/dbconfig/20220110-111034-marostegui.json [11:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:50] (03PS1) 104nn1l2: hewikisource: remove "קטע" namespace and its talk page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752634 (https://phabricator.wikimedia.org/T298430) [11:20:43] (03PS4) 10RhinosF1: check_user: catch manager being None [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) [11:21:04] (03CR) 10RhinosF1: check_user: catch manager being None (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) (owner: 10RhinosF1) [11:21:44] jbond: ^ [11:22:39] (03CR) 10Jbond: [C: 03+2] "LGTM will merge, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) (owner: 10RhinosF1) [11:22:44] RhinosF1: thanks merging [11:23:13] No problem, happy to help [11:23:18] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:25:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18455 and previous config saved to /var/cache/conftool/dbconfig/20220110-112538-marostegui.json [11:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] apergos: I notice https://dumps.wikimedia.org/ doesn't show my change. Does a cache need purging? [11:27:04] jouncebot: next [11:27:04] In 0 hour(s) and 32 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1200) [11:27:25] (03CR) 10Jbond: [C: 03+1] check_haproxy: improve failover output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [11:27:56] (03PS5) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [11:28:41] (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [11:32:00] I have a question: compare these two commit messages: [11:32:01] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/745220/7//COMMIT_MSG [11:32:28] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/752036/3//COMMIT_MSG [11:34:25] In the first one, why is the commit attached to the deployer, but in the second one to my own name? [11:35:52] nn1l2: in the first one, the patch set had to be rebased when deploying and the second one didn't need a rebase [11:37:20] I am trying to hide my timezone using the instructions at https://saebamini.com/Git-commit-with-UTC-timestamp-ignore-local-timezone/ [11:37:56] how can I also change the the commit date (as opposed to the author date) to UTC? [11:39:17] See https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/752634/1//COMMIT_MSG [11:39:46] I have been able to change my timezone for author date but not the commit date. [11:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18456 and previous config saved to /var/cache/conftool/dbconfig/20220110-114043-marostegui.json [11:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:48] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [11:40:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 9 hosts with reason: Maintenance [11:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 9 hosts with reason: Maintenance [11:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] (03CR) 10Jelto: [C: 03+2] ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [11:41:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance [11:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18457 and previous config saved to /var/cache/conftool/dbconfig/20220110-114305-marostegui.json [11:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:09] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10MatthewVernon) [11:43:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18458 and previous config saved to /var/cache/conftool/dbconfig/20220110-114326-marostegui.json [11:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:45] (03CR) 10Jelto: [V: 03+2 C: 03+2] ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [11:44:12] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10MatthewVernon) [much of the prod swift infrastructure is still running on Stretch, FWIW; Thanos frontends are now Bullseye] [11:46:22] (03PS3) 10ArielGlenn: Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [11:46:45] taavi: sorry, for the newbie question. I'm writing a guide page for deployment. What is the name of you folks? Deployers? [11:46:59] yeah, deployers works fine [11:49:57] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) > This one parses the firmware page: https://github.com/lateralblast/druid This uses seleni... [11:51:07] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-RhinosF1: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10RhinosF1) 05Open→03Resolved a:03RhinosF1 Thanks to @jbond for merge. [11:51:23] (03CR) 10ArielGlenn: "I've tested this in deployment-prep. Legoktm, I've made a small change just to make the format version an attribute and simplify the subcl" [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [11:51:56] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33163/console" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [11:53:05] RhinosF1: it workedforme so [11:53:11] maybe your browser cache, dunno [11:53:46] https://usercontent.irccloud-cdn.com/file/Avljjyxr/1641815620.JPG [11:54:10] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33164/console" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [11:58:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18459 and previous config saved to /var/cache/conftool/dbconfig/20220110-115830-marostegui.json [11:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:44] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10MoritzMuehlenhoff) >>! In T283771#7608623, @jbond wrote: >> This one parses the firmware page: htt... [12:01:03] jouncebot: hi? [12:01:05] jouncebot: now [12:01:06] For the next 0 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1200) [12:01:12] * cormacparle waves [12:01:24] hey, I can deploy today [12:01:29] I wonder what happened to jouncebot [12:01:42] ok great [12:01:43] cormacparle: does your config change depend on the backport? [12:01:49] no [12:02:13] taavi: here [12:02:13] hi [12:02:15] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) > It'll also be rather brittle since every year Firefox ESR bumps to a new version (e.g. fr... [12:02:18] jouncebot: refresh [12:02:19] I refreshed my knowledge about deployments. [12:02:24] I ahd been disconnected [12:03:32] @taavi if you're doing my deployments can you let me know when the backport is done, because I need to run a maint script when it is [12:03:39] cormacparle: sorry, I'm not comfortable with that backport, the maintenance script looks unsafe on production since it doesn't do any batching or sleeping between primary db calls [12:03:58] there are only 800 records to be changed [12:04:16] 2400 writes in a short period is a lot [12:04:39] ok if you're uncomfortable I can rewrite it and we can do it some other day [12:04:50] yeah I'd prefer that, thanks [12:05:13] but if the config patch is unrelated, we can do that [12:05:16] cool [12:05:22] (03PS4) 10David Caro: check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 [12:05:32] (03CR) 10Majavah: [C: 03+2] Add MediaSearch profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) (owner: 10Matthias Mullie) [12:05:43] (03PS3) 10Giuseppe Lavagetto: envoy: make the choice of api version explicit [puppet] - 10https://gerrit.wikimedia.org/r/751717 [12:05:45] (03PS3) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [12:06:25] (03Merged) 10jenkins-bot: Add MediaSearch profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) (owner: 10Matthias Mullie) [12:06:48] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33165/console" [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [12:06:55] cormacparle: the config change is on mwdebug1001, please test [12:07:17] kk [12:08:06] (03PS4) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [12:08:23] (03CR) 10Jbond: [C: 03+1] check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [12:09:18] @taavi all good, thank you [12:09:22] thanks, syncing [12:10:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:10:26] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747868|Add MediaSearch profiles (T297863)]] (duration: 00m 59s) [12:10:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33166/console" [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:30] T297863: Move MediaSearch boost params from MediaSearchProfiles.php to config - https://phabricator.wikimedia.org/T297863 [12:11:15] alright, let's do hauskatze's patch next since I think the namespace removal one needs a maintenance script run and I want to leave that last [12:11:26] (03PS4) 10Majavah: uzwiki: Amend Babel configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751545 (https://phabricator.wikimedia.org/T131924) (owner: 10MarcoAurelio) [12:11:33] alright, with you in a second [12:11:36] (03CR) 10Majavah: [C: 03+2] uzwiki: Amend Babel configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751545 (https://phabricator.wikimedia.org/T131924) (owner: 10MarcoAurelio) [12:11:37] updating bot policy [12:12:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) [12:12:25] (03Merged) 10jenkins-bot: uzwiki: Amend Babel configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751545 (https://phabricator.wikimedia.org/T131924) (owner: 10MarcoAurelio) [12:12:50] hauskatze: pulled to mwdebug1001, please test [12:12:58] checking [12:13:10] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10ayounsi) Relying on parsing a website is often asking for troubles. Maybe we can also ask our acco... [12:13:28] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18460 and previous config saved to /var/cache/conftool/dbconfig/20220110-121335-marostegui.json [12:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:12] urbanecm: Amir1: Lucas_WMDE: hey, any of you around? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/752634 removes namespaces, and I suspect it needs a maintenance script to fix the redirect pages to NS0, but not sure which script that would be and if we even have one [12:14:19] kind of a duplicate of namespaceDupes.php [12:14:26] hey taavi [12:14:32] I’m around but not sure what maintenance script that would be either [12:14:43] i always warn people it permanently deletes whatever was there [12:14:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:52] (03PS1) 10JMeybohm: Use promtool in PATH rather than /usr/bin/promtool [alerts] - 10https://gerrit.wikimedia.org/r/752651 [12:14:52] I'm about to go to a meeting [12:14:54] (03PS1) 10JMeybohm: Add rule to alert on terminated/failing rsyslog mmkubernetes [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) [12:15:03] i don't think we have a script that moves the pages to ns0 [12:15:06] (03CR) 10David Caro: [C: 03+2] check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [12:15:20] finding #babel pages is hard today, let me create a test page taavi for checking [12:15:29] sure, no hurry [12:15:45] urbanecm: ah, just realized the task asks to delete everything there [12:16:05] taavi: then just ensure _permanent_ deletion is fine [12:16:24] I see no change at https://uz.wikipedia.org/wiki/Foydalanuvchi:MarcoAurelio/Sandbox on mwdebug1001 [12:16:35] I'll purge [12:16:51] yep, that fixed it [12:17:05] now they have both categories as requested [12:17:08] cool, syncing [12:17:13] User-xx and User-xx-level [12:17:33] (personally I find it weird to have both but... not my call!) [12:17:39] (03PS5) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [12:17:51] hauskatze: it's doing its job :)) [12:18:04] (03CR) 10David Caro: {p,r}:dumps:generation:sever:alldumps: remove usused role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751693 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:18:27] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:751545|uzwiki: Amend Babel configuration (T131924)]] (duration: 00m 59s) [12:18:29] urbanecm: yeah, all the pages are redirects so I guess it's fine, but we can always double check [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] T131924: Babel configuration for uz.wikipedia - https://phabricator.wikimedia.org/T131924 [12:18:35] also, taavi, welcome to the all-powerful on-wiki interface :)) [12:18:38] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10MoritzMuehlenhoff) >>! In T283771#7608655, @jbond wrote: >> It'll also be rather brittle since eve... [12:18:38] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33167/console" [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [12:18:50] (03CR) 10David Caro: [C: 03+2] elasticsearch:decommission: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751088 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:18:59] thanks :D [12:19:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:19:05] and yeah, an explicit acknowledgement is always welcomed, to avoid you having to revert the change [12:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] (or worse, me :D) [12:19:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:19:38] pages seem to be already deleted https://he.wikisource.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%A9%D7%99%D7%A0%D7%95%D7%99%D7%99%D7%9D_%D7%90%D7%97%D7%A8%D7%95%D7%A0%D7%99%D7%9D?hidebots=1&hidecategorization=1&hideWikibase=1&namespace=100&limit=500&days=30&urlversion=2 [12:20:03] I don't speak Hebrew, of course, and I only guess [12:20:07] it still has some pages, https://he.wikisource.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%93%D7%A4%D7%99%D7%9D_%D7%94%D7%9E%D7%AA%D7%97%D7%99%D7%9C%D7%99%D7%9D_%D7%91?prefix=&namespace=100 and [12:20:07] https://he.wikisource.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%93%D7%A4%D7%99%D7%9D_%D7%94%D7%9E%D7%AA%D7%97%D7%99%D7%9C%D7%99%D7%9D_%D7%91?prefix=&namespace=101 [12:20:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33168/console" [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [12:20:35] nn1l2: and even if they were deleted, removing the NS deletes them permanently (so even admins can't see/view/restore them) [12:20:43] I'll drop a note on the task [12:20:47] sounds good [12:21:12] Just closed a 6 y/o task \o/ [12:21:24] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) > Maybe we can also ask our account rep. for their recommendation (different API, etc). wil... [12:21:43] taavi: if you have time, would you mind doing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/752187 too? [12:21:57] sure [12:21:58] (new variable will be needed with wmf.17, so no testing can be done) [12:22:06] (03PS3) 10Majavah: Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792) (owner: 10Urbanecm) [12:22:11] (03CR) 10Majavah: [C: 03+2] Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792) (owner: 10Urbanecm) [12:23:00] (03Merged) 10jenkins-bot: Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792) (owner: 10Urbanecm) [12:23:50] (03PS2) 10Majavah: beta: Enable temporary global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) [12:24:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:22] temporary global groups, awesome taavi [12:24:24] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752187|Growth: Add GEMentorDashboardDeploymentMode (T298792)]] (duration: 00m 59s) [12:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:27] T298792: Mentor dashboard: Make it possible to deploy a module only on the pilot wikis - https://phabricator.wikimedia.org/T298792 [12:24:31] urbanecm: that one is now live [12:24:32] hauskatze: where can i sign that? [12:24:44] (03CR) 10Majavah: [C: 03+2] beta: Enable temporary global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [12:24:51] urbanecm: ^ [12:25:14] fancy stuff [12:25:14] hauskatze: hopefully coming to production next week :-) [12:25:25] (03Merged) 10jenkins-bot: beta: Enable temporary global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [12:25:38] taavi: I'll nominate you for a t-shirt if they work :) [12:26:37] I don't think I'm eligble, I've already received free merch (a hoodie) from the WMF [12:27:01] nn1l2: still around? [12:27:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:09] looks like we can proceed with the namespace deleting one [12:27:12] yes taavi [12:27:24] (03PS2) 10Majavah: hewikisource: remove "קטע" namespace and its talk page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752634 (https://phabricator.wikimedia.org/T298430) (owner: 104nn1l2) [12:28:21] (03CR) 10Majavah: [C: 03+2] hewikisource: remove "קטע" namespace and its talk page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752634 (https://phabricator.wikimedia.org/T298430) (owner: 104nn1l2) [12:28:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18462 and previous config saved to /var/cache/conftool/dbconfig/20220110-122840-marostegui.json [12:28:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:28:44] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [12:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18463 and previous config saved to /var/cache/conftool/dbconfig/20220110-122847-marostegui.json [12:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:11] hauskatze: https://meta.wikimedia.beta.wmflabs.org/wiki/Special:GlobalUserRights/Majavah :P [12:29:16] (03Merged) 10jenkins-bot: hewikisource: remove "קטע" namespace and its talk page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752634 (https://phabricator.wikimedia.org/T298430) (owner: 104nn1l2) [12:29:36] nn1l2: your patch is on mwdebug1001, can you test please? [12:29:47] give me a minute [12:29:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18464 and previous config saved to /var/cache/conftool/dbconfig/20220110-123009-marostegui.json [12:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] it's still there [12:32:49] now seems gone [12:32:58] maybe some caching? [12:32:59] maybe a cache issue [12:33:32] special:allpages doesn't show it for me, and the pages are not visible [12:33:39] syncing them [12:33:54] Good to go [12:34:39] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752634|hewikisource: remove "קטע" namespace and its talk page (T298430)]] (duration: 00m 58s) [12:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] taavi: I'll wait until they hit production :) [12:34:42] T298430: hewikisource - remove "קטע" namespace - https://phabricator.wikimedia.org/T298430 [12:34:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] maybe I'll test them later today [12:35:00] at beta [12:35:38] anyone have anything else to deploy? [12:36:03] !log UTC morning deploys done [12:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:52] taavi: again sorry for the rookie question: calling you folks developers is silly? because I, as someone who upload patches, is considered developer too? Is this correct? [12:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2032 for Bullseye reimage T295965', diff saved to https://phabricator.wikimedia.org/P18465 and previous config saved to /var/cache/conftool/dbconfig/20220110-124222-marostegui.json [12:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:25] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [12:43:12] (03PS1) 10Marostegui: es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752655 (https://phabricator.wikimedia.org/T295965) [12:43:50] (03CR) 10Marostegui: [C: 03+2] es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752655 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [12:44:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2032.codfw.wmnet with OS bullseye [12:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18466 and previous config saved to /var/cache/conftool/dbconfig/20220110-124513-marostegui.json [12:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:28] nn1l2: depends on who you ask, but I'd consider config patches be more system administration work than development [12:47:54] so using the word developer is false either for you and me [12:51:13] nn1l2: well, taavi also does a lot of development :)) [12:52:12] Yeah, but we are talking only about the patches I upload. You are deployers and I am ...? [12:53:50] a person that uploads config patches is the most precise description [12:54:21] thanks! that's very fancy :) [12:54:43] I have another *real* question/problem [12:54:43] we can't call you a system administrator (because you don't _actually_ have sysadmin-level of access) [12:54:59] nor a developer because, as taa.vi said, it's not actually development [12:55:00] Recently, Wikimedia Debug add-on is playing with me [12:55:04] yeah? [12:56:00] It does not appear, I should reload the page multiple times so that it appears [12:56:04] I have ascrenshot [12:56:12] paste it here :) [12:56:27] https://pasteboard.co/i00SL89BYLyB.png [12:56:57] can you verfiy the plugin is up to date? [12:57:17] I installed it maybe a year ago [12:57:29] so no, I can't confirm [12:58:01] let's phrase it differently [12:58:05] what's the version you have installed? [12:58:21] (right-click on the icon and press "Manage extension") [12:58:45] 2.4.5 [12:59:08] that's the newest one [12:59:18] I'll ask you to make sure chrome is updated now [12:59:30] and if it is, to fill a task in phabricator :)) [12:59:41] Version 97.0.4692.71 (Official Build) (64-bit) [13:00:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18467 and previous config saved to /var/cache/conftool/dbconfig/20220110-130018-marostegui.json [13:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:36] yes, my chrome is up-to-date [13:00:42] okay [13:01:03] then fill a task in https://phabricator.wikimedia.org/project/view/4924/ please [13:02:24] Thanks, I will. [13:02:29] !log installing ghostscript security updates [13:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:35] (03PS12) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [13:11:39] (03PS1) 10Ladsgroup: auto_schema: Make sure the port :3306 is removed during check for depool [software] - 10https://gerrit.wikimedia.org/r/752656 (https://phabricator.wikimedia.org/T288235) [13:13:52] (03CR) 10Marostegui: [C: 03+1] auto_schema: Make sure the port :3306 is removed during check for depool [software] - 10https://gerrit.wikimedia.org/r/752656 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:14:23] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Make sure the port :3306 is removed during check for depool [software] - 10https://gerrit.wikimedia.org/r/752656 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:15:01] (03Merged) 10jenkins-bot: auto_schema: Make sure the port :3306 is removed during check for depool [software] - 10https://gerrit.wikimedia.org/r/752656 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297191)', diff saved to https://phabricator.wikimedia.org/P18468 and previous config saved to /var/cache/conftool/dbconfig/20220110-131523-marostegui.json [13:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:27] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [13:16:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2032.codfw.wmnet with OS bullseye [13:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:44] (03PS1) 10Kosta Harlan: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) [13:18:18] (03CR) 10Kosta Harlan: [C: 04-2] "Earliest this would be scheduled for is Thursday January 13, when wmf.17 is in group2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan) [13:36:08] Aren't you going to update https://wikitech.wikimedia.org/wiki/Deployments ? [13:36:37] For example, I don't see the name of taavi there currently. It should be added. [13:49:50] (03PS1) 10Ladsgroup: auto_schema: Remove the probelmatic override in depool logic [software] - 10https://gerrit.wikimedia.org/r/752662 (https://phabricator.wikimedia.org/T288235) [13:51:54] (03CR) 10Marostegui: [C: 03+1] auto_schema: Remove the probelmatic override in depool logic [software] - 10https://gerrit.wikimedia.org/r/752662 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:52:24] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Remove the probelmatic override in depool logic [software] - 10https://gerrit.wikimedia.org/r/752662 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:52:58] (03Merged) 10jenkins-bot: auto_schema: Remove the probelmatic override in depool logic [software] - 10https://gerrit.wikimedia.org/r/752662 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:54:40] !log upgrading oozie packages in reprepro in order to pick up new log4j version [13:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:56:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:57:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [13:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [13:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:21] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) I had another look at the dell pages and i have worked out a way to pull via an undocumente... [14:05:07] (03PS1) 10Muehlenhoff: Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/752663 [14:05:37] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:55] !log disable puppet fleet wide for puppetdb restart [14:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:10] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10JVargas) Hi @Dzahn - Just saw the error on the request, and in case it's helpful, my manager is Lisa Gruwell. Let me know if there's anything else I can provide. Thanks! [14:13:46] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/752663 (owner: 10Muehlenhoff) [14:16:45] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:19:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM moscovium.eqiad.wmnet [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:40] !upload wmf-sre-laptop 0.5.3 deb package [14:19:49] !log upload wmf-sre-laptop 0.5.3 deb package [14:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM moscovium.eqiad.wmnet [14:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:57] (03PS1) 10Ladsgroup: Add License and README [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752667 (https://phabricator.wikimedia.org/T288235) [14:22:52] (03PS1) 10Ladsgroup: Give priority to PreparedUpdate [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752277 (https://phabricator.wikimedia.org/T288639) [14:23:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add License and README [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752667 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [14:24:15] jouncebot: nowandnext [14:24:15] No deployments scheduled for the next 2 hour(s) and 5 minute(s) [14:24:15] In 2 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1630) [14:24:19] oof [14:24:23] (03PS1) 10Marostegui: drop_pr_user_T297191.py: Schema change to drop pr_user [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752669 (https://phabricator.wikimedia.org/T297191) [14:24:25] (03CR) 10Ladsgroup: [C: 03+2] Give priority to PreparedUpdate [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752277 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [14:25:26] (03CR) 10Ladsgroup: [C: 03+1] drop_pr_user_T297191.py: Schema change to drop pr_user [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752669 (https://phabricator.wikimedia.org/T297191) (owner: 10Marostegui) [14:26:00] (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_pr_user_T297191.py: Schema change to drop pr_user [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752669 (https://phabricator.wikimedia.org/T297191) (owner: 10Marostegui) [14:27:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test1001.wikimedia.org [14:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test1001.wikimedia.org [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:58] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:10] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:36:51] !log jbond@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM puppetdb1002.eqiad.wmnet [14:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:58] (03PS1) 10Ladsgroup: auto_schema: Remove default port in the Host [software] - 10https://gerrit.wikimedia.org/r/752671 (https://phabricator.wikimedia.org/T288235) [14:40:15] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) The server reimage to bullseye is incomplete due to missing packages (among other things). I found [[ https://phabricator.wikimedia.org/T289135 | an epic with more... [14:40:38] (03CR) 10Marostegui: [C: 03+1] auto_schema: Remove default port in the Host [software] - 10https://gerrit.wikimedia.org/r/752671 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [14:41:15] (03Merged) 10jenkins-bot: Give priority to PreparedUpdate [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752277 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [14:41:52] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Remove default port in the Host [software] - 10https://gerrit.wikimedia.org/r/752671 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [14:42:26] (03Merged) 10jenkins-bot: auto_schema: Remove default port in the Host [software] - 10https://gerrit.wikimedia.org/r/752671 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [14:42:47] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Thanks @Dzahn. Let's stick with this ticket if it's easiest? I'll get the wikitech names of the other fr-... [14:45:17] (03PS1) 10Btullis: Exclude log4j_extras from the classpath for coordinators [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) [14:45:53] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10MoritzMuehlenhoff) Was this intentionally reimaged with Bullseye? I wouldn't entangle this with a hardware maintenance and simply reimage with stretch and then start the B... [14:46:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [14:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T297191)', diff saved to https://phabricator.wikimedia.org/P18469 and previous config saved to /var/cache/conftool/dbconfig/20220110-144737-marostegui.json [14:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:41] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [14:48:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:03] (03CR) 10JMeybohm: "Adding you as well for review because Filippo is out and I would like to have this before the equad ganeti reboots for https://phabricator" [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [14:49:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297191)', diff saved to https://phabricator.wikimedia.org/P18470 and previous config saved to /var/cache/conftool/dbconfig/20220110-144907-marostegui.json [14:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:46] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:752277|Give priority to PreparedUpdate (T288639)]] (duration: 01m 00s) [14:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [14:49:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:13] (03PS1) 10Vgutierrez: cache::envoy: Bump request timeout to 300s [puppet] - 10https://gerrit.wikimedia.org/r/752674 (https://phabricator.wikimedia.org/T271421) [14:50:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33169/console" [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [14:51:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:04] (03CR) 10Ema: [C: 03+1] cache::envoy: Bump request timeout to 300s [puppet] - 10https://gerrit.wikimedia.org/r/752674 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:52:18] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Bump request timeout to 300s [puppet] - 10https://gerrit.wikimedia.org/r/752674 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:55:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetdb1002.eqiad.wmnet [14:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:00] (03PS2) 10Btullis: Exclude log4j_extras from the classpath for coordinators [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) [14:59:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33170/console" [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [15:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18471 and previous config saved to /var/cache/conftool/dbconfig/20220110-150412-marostegui.json [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:27] So I dont have the op flag in here [15:05:41] can someone who does change it from me to cmooney for clinic duty in topic? [15:08:07] (03CR) 10Herron: [C: 03+1] Add rule to alert on terminated/failing rsyslog mmkubernetes [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [15:13:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Clean up nova-network remains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751949 (owner: 10Majavah) [15:17:49] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:19:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18472 and previous config saved to /var/cache/conftool/dbconfig/20220110-151917-marostegui.json [15:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:36] (03CR) 10Elukey: "One of the two hiera changes is wrong, it is applied to the hadoop master role rather than the coordinator node." [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [15:21:38] (03PS1) 10Herron: kafka-logging: move to fixed UID/GID for kafka user [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) [15:22:47] herron: <3 [15:26:18] (03PS2) 10JMeybohm: Add rule to alert on terminated/failing rsyslog mmkubernetes [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) [15:30:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33171/console" [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) (owner: 10Herron) [15:30:28] (03CR) 10Ema: [V: 03+2 C: 03+2] Release 6.0.9-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752153 (https://phabricator.wikimedia.org/T298758) (owner: 10Ema) [15:31:09] (03CR) 10Elukey: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) (owner: 10Herron) [15:32:15] (03CR) 10Herron: kafka-logging: move to fixed UID/GID for kafka user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) (owner: 10Herron) [15:34:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297191)', diff saved to https://phabricator.wikimedia.org/P18474 and previous config saved to /var/cache/conftool/dbconfig/20220110-153421-marostegui.json [15:34:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:34:25] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [15:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18475 and previous config saved to /var/cache/conftool/dbconfig/20220110-153429-marostegui.json [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18476 and previous config saved to /var/cache/conftool/dbconfig/20220110-153559-marostegui.json [15:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:19] (03CR) 10JMeybohm: [C: 03+2] Add rule to alert on terminated/failing rsyslog mmkubernetes [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [15:45:23] (03Merged) 10jenkins-bot: Add rule to alert on terminated/failing rsyslog mmkubernetes [alerts] - 10https://gerrit.wikimedia.org/r/752652 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [15:45:44] (03CR) 10LMata: "We have reviewed this internally in IF and is approved." [puppet] - 10https://gerrit.wikimedia.org/r/751104 (owner: 10Elukey) [15:45:46] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM dragonfly-supernode1001.eqiad.wmnet [15:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:10] (03PS3) 10Elukey: admin: allow all Analytics/DE members to manage cassandra on AQS [puppet] - 10https://gerrit.wikimedia.org/r/751104 [15:49:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dragonfly-supernode1001.eqiad.wmnet [15:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:51] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [15:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:05] (03CR) 10Elukey: [C: 03+2] admin: allow all Analytics/DE members to manage cassandra on AQS [puppet] - 10https://gerrit.wikimedia.org/r/751104 (owner: 10Elukey) [15:50:23] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:51:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P18478 and previous config saved to /var/cache/conftool/dbconfig/20220110-155103-marostegui.json [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:12] (03PS2) 10Andrew Bogott: wmcs/nfs/add_server: specify mount options tuned for NFS [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752367 [15:56:33] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM chartmuseum1001.eqiad.wmnet [15:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] (03CR) 10Gehel: [C: 03+2] sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [15:57:41] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry1003.eqiad.wmnet [15:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:29] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1003.eqiad.wmnet [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:47] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM chartmuseum1001.eqiad.wmnet [16:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:44] (03PS1) 10Cparle: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752694 (https://phabricator.wikimedia.org/T297484) [16:05:46] (03PS1) 10Cparle: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752695 (https://phabricator.wikimedia.org/T297484) [16:06:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P18479 and previous config saved to /var/cache/conftool/dbconfig/20220110-160608-marostegui.json [16:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:38] !log root@cumin1001 START - Cookbook sre.dns.netbox [16:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:01] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:07] !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:05] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry1004.eqiad.wmnet [16:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:27] (03PS1) 10Muehlenhoff: Add thirdparty/elastic65 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/752697 (https://phabricator.wikimedia.org/T289135) [16:20:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [16:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [16:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18480 and previous config saved to /var/cache/conftool/dbconfig/20220110-162114-marostegui.json [16:21:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [16:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [16:21:18] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [16:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T297191)', diff saved to https://phabricator.wikimedia.org/P18481 and previous config saved to /var/cache/conftool/dbconfig/20220110-162122-marostegui.json [16:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1004.eqiad.wmnet [16:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:53] (03CR) 10jerkins-bot: [V: 04-1] Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752695 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [16:22:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti2023.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [16:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti2023.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297191)', diff saved to https://phabricator.wikimedia.org/P18482 and previous config saved to /var/cache/conftool/dbconfig/20220110-162249-marostegui.json [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:02] (03CR) 10Muehlenhoff: [C: 03+2] Add thirdparty/elastic65 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/752697 (https://phabricator.wikimedia.org/T289135) (owner: 10Muehlenhoff) [16:25:48] (03PS1) 10Muehlenhoff: Also enable updates for elastic65 repo to thirdparty/elastic65 for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/752699 (https://phabricator.wikimedia.org/T289135) [16:26:11] (03Abandoned) 10Cparle: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752695 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [16:26:19] (03Abandoned) 10Cparle: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752694 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [16:29:04] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10MPhamWMF) [16:29:44] (03CR) 10Muehlenhoff: [C: 03+2] Also enable updates for elastic65 repo to thirdparty/elastic65 for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/752699 (https://phabricator.wikimedia.org/T289135) (owner: 10Muehlenhoff) [16:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1630). [16:30:42] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) [16:32:02] (03PS1) 10Ladsgroup: auto_schema: Force depool in codfw for mysql upgrades [software] - 10https://gerrit.wikimedia.org/r/752700 (https://phabricator.wikimedia.org/T239814) [16:33:16] (03PS2) 10Cparle: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751836 (https://phabricator.wikimedia.org/T297484) [16:33:18] (03PS1) 10Cparle: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752701 (https://phabricator.wikimedia.org/T297484) [16:33:56] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10MPhamWMF) [16:37:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18483 and previous config saved to /var/cache/conftool/dbconfig/20220110-163754-marostegui.json [16:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:29] !log installing 5.10.84 kernels on bullseye hosts (no reboots involved, just installing the new kernels in parallel) [16:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:19] 10SRE, 10Discovery-Search (Current work): Consider filesystem/disk based improvements on WQDS servers - https://phabricator.wikimedia.org/T298570 (10Gehel) 05Open→03Declined [16:50:11] 10SRE, 10Discovery-Search (Current work): Get familiar with ES non-prod environments - https://phabricator.wikimedia.org/T298817 (10Gehel) 05Open→03Resolved a:03Gehel [16:52:12] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:52:14] !log varnish 6.0.9-1wm1 uploaded to buster-wikimedia - component/varnish6 T298758 [16:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:17] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [16:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18484 and previous config saved to /var/cache/conftool/dbconfig/20220110-165259-marostegui.json [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:01] (03CR) 10Ahmon Dancy: [C: 03+1] "I'm looking forward to testing this!" [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [16:54:37] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10MPhamWMF) [16:55:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10jcrespo) 05Open→03Resolved Deployed the change as this: `lines=10,lang=diff commit baeb288d4d9713814ac88e9537bbcf0ece5bb9e4... [16:56:41] (03Abandoned) 10Jcrespo: mediabackups: Add minio port to ipv6 connections [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:58:22] (03PS1) 10Ssingh: dnsrecursor: add support for DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752706 [16:59:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33172/console" [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [17:00:01] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10MPhamWMF) [17:02:29] (03PS2) 10Ssingh: dnsrecursor: add support for DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752706 [17:03:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33173/console" [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [17:05:12] (03PS3) 10Ssingh: dnsrecursor: add support for DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752706 [17:06:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33174/console" [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [17:07:27] (03CR) 10Ssingh: [V: 03+1] "PCC confirms no change to existing hosts." [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [17:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297191)', diff saved to https://phabricator.wikimedia.org/P18485 and previous config saved to /var/cache/conftool/dbconfig/20220110-170804-marostegui.json [17:08:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:08:07] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T297191)', diff saved to https://phabricator.wikimedia.org/P18486 and previous config saved to /var/cache/conftool/dbconfig/20220110-170811-marostegui.json [17:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297191)', diff saved to https://phabricator.wikimedia.org/P18487 and previous config saved to /var/cache/conftool/dbconfig/20220110-170941-marostegui.json [17:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:03] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1005.eqiad.wmnet [17:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1005.eqiad.wmnet [17:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1006.eqiad.wmnet [17:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:31] (03PS1) 10Eigyan: wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) [17:22:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:23:02] that's me [17:23:07] (03PS2) 10Eigyan: wmf-config: Update coverage to 0.5 in gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) [17:23:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1006.eqiad.wmnet [17:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18488 and previous config saved to /var/cache/conftool/dbconfig/20220110-172446-marostegui.json [17:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:00] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1015.eqiad.wmnet [17:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1015.eqiad.wmnet [17:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:15] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1016.eqiad.wmnet [17:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1016.eqiad.wmnet [17:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:41] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10JMeybohm) [17:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18489 and previous config saved to /var/cache/conftool/dbconfig/20220110-173950-marostegui.json [17:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:40] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Krinkle) [17:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297191)', diff saved to https://phabricator.wikimedia.org/P18491 and previous config saved to /var/cache/conftool/dbconfig/20220110-175455-marostegui.json [17:54:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:54:59] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [17:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18492 and previous config saved to /var/cache/conftool/dbconfig/20220110-175503-marostegui.json [17:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18493 and previous config saved to /var/cache/conftool/dbconfig/20220110-175633-marostegui.json [17:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1800). [18:06:11] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: cr-codfw: set up static route for 185.15.57.8/30 - https://phabricator.wikimedia.org/T295288 (10Krinkle) [18:09:23] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10Krinkle) [18:09:28] (03CR) 10Bking: [C: 03+1] query_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [18:10:41] (03CR) 10Bking: [C: 03+2] rdf query service: limit namespace aliasing to /bigdata/namespace [puppet] - 10https://gerrit.wikimedia.org/r/744892 (owner: 10Ebernhardson) [18:10:43] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [18:11:16] 10SRE, 10Sustainability (Incident Followup): 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Krinkle) [18:11:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18494 and previous config saved to /var/cache/conftool/dbconfig/20220110-181137-marostegui.json [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:10] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10wiki_willy) a:03Cmjohnson [18:26:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18495 and previous config saved to /var/cache/conftool/dbconfig/20220110-182642-marostegui.json [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:24] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10wiki_willy) a:03Papaul Hi @RKemper - since the refresh for this host was installed via T294154, are you ok if we ignore/resolve this degraded raid alert? Thanks, Willy [18:28:57] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10wiki_willy) a:03Papaul [18:31:31] (03CR) 10Dzahn: [C: 03+2] delete uncompressed HTML files, only keep compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/752232 (owner: 10Dzahn) [18:31:49] (03CR) 10MSantos: [C: 03+1] Disable tilerator in all envs maps are deployed [puppet] - 10https://gerrit.wikimedia.org/r/752145 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos) [18:32:08] deletes 10000 files and hopes CI is all cool with that [18:32:31] but reduces image size [18:35:58] (03Merged) 10jenkins-bot: delete uncompressed HTML files, only keep compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/752232 (owner: 10Dzahn) [18:36:17] :) merged by CI. good [18:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18496 and previous config saved to /var/cache/conftool/dbconfig/20220110-184147-marostegui.json [18:41:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:41:51] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [18:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18497 and previous config saved to /var/cache/conftool/dbconfig/20220110-184154-marostegui.json [18:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Cmjohnson) [18:50:22] (03PS2) 10Dzahn: fix content type for HTML, it's not CSS [container/miscweb] - 10https://gerrit.wikimedia.org/r/752235 (https://phabricator.wikimedia.org/T281538) [18:50:45] (03CR) 10Dzahn: "too much time spent wondering why this kind of works but is not rendered as HTML" [container/miscweb] - 10https://gerrit.wikimedia.org/r/752235 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:54:49] (03CR) 10Dzahn: "meant more security vs infra-security" [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:58:17] (03CR) 10Dzahn: "@Reedy any thoughts on usage of peek in the future from your team?" [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [19:00:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:00:04] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T1900). [19:00:04] SCardenasM : A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:24] i can deploy today [19:00:29] SCardenasM: hi, are you around? [19:00:37] Hi! I am here! [19:01:07] I don't recall your name, so...welcome to B&C if you weren't here before [19:01:25] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [19:01:28] This is my first backport [19:01:52] Congrats! [19:02:08] Thanks :) [19:02:25] okay, noted. If anything is unclear during the process, please do speak up -- no stupid questions exists, and I'm here to guide you through it :) [19:02:31] (03PS6) 10Urbanecm: Enable TheWikipediaLibrary on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:03:05] (03CR) 10Urbanecm: [C: 03+2] Enable TheWikipediaLibrary on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:03:22] (03PS7) 10Dzahn: apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [19:03:40] SCardenasM: if you didn't do that already, can you please install https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage in your browser? It will help you to test the change before it goes to the users. [19:03:40] (03CR) 10Dzahn: "PS7: I added the test joe asked about previously." [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [19:03:41] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:04:06] urbanecm got it. Installing now [19:04:10] thank you [19:04:14] (03Merged) 10jenkins-bot: Enable TheWikipediaLibrary on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:05:31] (03CR) 10Dzahn: "joe, let's merge this one before the other rewrite? it's about time for this one" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [19:06:17] 10SRE, 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10Cmjohnson) The mgmt switch power led was amber, tried pulling the power and plugging back in but no change. We had a spare wmf4921, racked it, and moved all the mgmt cables. I need to update netbox and check with... [19:06:22] SCardenasM: your patch is now at mwdebug1001. When you're ready, please enable that extension, pick "mwdebug1001.eqiad.wmnet" in there, and try to make sure the change does what it is supposed to (a minimum test would consist of "ensure wiki doesn't break and the newly-enabled extension appears at Special:Version in the list of installed extension"). [19:07:11] since your change affects all the wikis, you can pick whichever wiki you want as your testing ground [19:07:47] (03CR) 10Dzahn: "it's just that now I expect this also needs to be deployed in k8s meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [19:08:18] SCardenasM: feel free to take your time during the process and let me know how that goes. You're the only B&C customer today, so you've a plenty of time. [19:08:38] urbanecm: ok. I'll test right now, although I might need a user with 50,000+ edits to test that the notification is shown. It should trigger once a user makes 2 edits [19:08:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:40] If it shows for users who already saw it at meta, I can take on that part. If not, I don't think we need to test that part (it went through testing both at meta and beta AFAIK). [19:10:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:30] urbanecm: yup, we got it tested in Meta and test [19:10:41] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) sure, sounds all good to me :) [19:11:17] SCardenasM: great. Then just the minimum test (nothing breaks + listed in special:Version) would be enough from my side (unless it's possible to easily test more, of course). [19:14:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:58] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) @JVargas Hello, don't worry about that part. it was a little bug in software but it has since been fixed. Thanks for adding that Lisa is a your manager. I'll get this going! [19:23:07] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) p:05Triage→03High a:03Dzahn [19:23:27] SCardenasM: how is it going? 🙂 Anything i can help with? [19:24:46] urbanecm: I have tested on en wiki (logged in to my personal account, browsed some articles, and made an edit and it looks like nothing broke). I also checked that the version in Special:Version is the latest version of the extension [19:25:00] okay, excellent [19:25:38] no errors in the logs [19:25:39] deploying [19:26:05] RECOVERY - DNS on mw1454.mgmt is OK: DNS OK: 0.012 seconds response time. mw1454.mgmt.eqiad.wmnet returns 10.65.1.194 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:27:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8f5ca9af5ef04d1d19759cdf201fc0c7e4ee6fbc: Enable TheWikipediaLibrary on most wikis (T288070) (duration: 01m 00s) [19:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:08] T288070: Deploy The Wikipedia Library Echo notification with 50,000 edit count threshold - https://phabricator.wikimedia.org/T288070 [19:27:17] SCardenasM: it's live [19:27:21] anything else i can do for you today? [19:28:30] Nothing else! Let me know if there are any problems [19:28:59] will do SCardenasM [19:29:14] !log UTC evening B&C finished [19:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:13] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) confirmed all this is in Namely. adding to wmf group [19:33:32] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) @JVargas You have been added to the requested group. This gives you a bunch of new access. You can see a list here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#wmf_group There... [19:34:45] (03PS1) 10Ebernhardson: cirrussearch: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) [19:34:59] bruh: T298784 [19:35:00] T298784: Security Issue Access Request for Zabe - https://phabricator.wikimedia.org/T298784 [19:37:03] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10Dzahn) [19:38:06] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10Dzahn) extended downtime for "this host and all services on it" by 2 months [19:42:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18499 and previous config saved to /var/cache/conftool/dbconfig/20220110-194214-marostegui.json [19:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:18] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [19:45:53] (03PS1) 10Dzahn: admins: add jvargas to ldap_only_admins, added to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/752725 (https://phabricator.wikimedia.org/T298719) [19:48:07] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10kchapman) [19:49:06] (03CR) 10Dzahn: "Cathal, if you wanna cross check me and close out https://phabricator.wikimedia.org/T298719 .. all it should need is merging this." [puppet] - 10https://gerrit.wikimedia.org/r/752725 (https://phabricator.wikimedia.org/T298719) (owner: 10Dzahn) [19:51:42] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10Papaul) 05Open→03Resolved thanks @Dzahn [19:52:15] RECOVERY - DNS on mw1455.mgmt is OK: DNS OK: 0.014 seconds response time. mw1455.mgmt.eqiad.wmnet returns 10.65.1.195 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:52:29] RECOVERY - DNS on mw1453.mgmt is OK: DNS OK: 0.012 seconds response time. mw1453.mgmt.eqiad.wmnet returns 10.65.1.193 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:52:33] (03CR) 10Dzahn: "already added to the actual group on mwmaint1002, so for user it already works. this is to keep reality in sync with code. if we don't che" [puppet] - 10https://gerrit.wikimedia.org/r/752725 (https://phabricator.wikimedia.org/T298719) (owner: 10Dzahn) [19:53:26] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) a:05Dzahn→03None [19:53:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) p:05High→03Medium [19:57:01] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Dzahn) ACK, understood @Hashar, then this goes back to #ops-codfw [19:57:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18500 and previous config saved to /var/cache/conftool/dbconfig/20220110-195719-marostegui.json [19:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:00] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul it seems that "broken DRAC" is actually the reason for b... [19:59:20] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10observability: contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10Dzahn) [20:00:10] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10observability: contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10Dzahn) [20:00:30] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [20:01:24] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10observability: contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10Dzahn) added both T283582 and T294276 as parent tasks cc: #releng-radar [20:03:26] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10observability, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10Dzahn) [20:05:40] (03CR) 10SBassett: [C: 03+1] "I can't imagine this will ever be used again." [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [20:06:49] RECOVERY - DNS on db1161.mgmt is OK: DNS OK: 0.013 seconds response time. db1161.mgmt.eqiad.wmnet returns 10.65.0.174 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:25] (03CR) 10Dzahn: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/752235 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:08:43] RECOVERY - DNS on ganeti1023.mgmt is OK: DNS OK: 0.009 seconds response time. ganeti1023.mgmt.eqiad.wmnet returns 10.65.1.207 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:09:29] (03PS1) 10Ssingh: O:wikidough: enable DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752726 [20:09:38] (03CR) 10Dzahn: [C: 03+2] "ok, thanks. I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [20:12:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18501 and previous config saved to /var/cache/conftool/dbconfig/20220110-201224-marostegui.json [20:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:32] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [20:20:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [20:21:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [20:23:55] (03PS1) 10Andrew Bogott: nfs-exportd: Handle an edge case for the one-volume-per-server future [puppet] - 10https://gerrit.wikimedia.org/r/752730 [20:24:51] RECOVERY - DNS on mw1456.mgmt is OK: DNS OK: 0.013 seconds response time. mw1456.mgmt.eqiad.wmnet returns 10.65.1.196 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:02] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: Handle an edge case for the one-volume-per-server future [puppet] - 10https://gerrit.wikimedia.org/r/752730 (owner: 10Andrew Bogott) [20:26:37] (03CR) 10Thcipriani: [C: 03+1] wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis) [20:27:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297191)', diff saved to https://phabricator.wikimedia.org/P18502 and previous config saved to /var/cache/conftool/dbconfig/20220110-202728-marostegui.json [20:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:33] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [20:29:43] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:40] (03CR) 10Thcipriani: [C: 03+1] "neat!" [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [20:33:19] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/nfs/add_server: specify mount options tuned for NFS [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752367 (owner: 10Andrew Bogott) [20:51:37] 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Quiddity) After some recent spam over the weekend (:-/) I've asked on-list. [20:53:03] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Ahecht) Stretch was supposed to be phased out by June 2021 per https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy, and will be EOL in less than 6 months (June 30, 2022) p... [20:57:49] (03PS1) 10Ebernhardson: wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 [21:00:05] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T2100). [21:00:21] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:06] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) >>! In T216815#7610660, @Ahecht wrote: > Is any work being done on this? At the moment, no. Thumbor currently has no maintainer, see T294484. [21:31:05] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:03] (03PS2) 10MSantos: maps: script to send zoom level expiration events [puppet] - 10https://gerrit.wikimedia.org/r/740236 [21:53:53] (03PS3) 10MSantos: maps: script to send zoom level expiration events [puppet] - 10https://gerrit.wikimedia.org/r/740236 [22:00:05] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220110T2200). [22:01:55] (03CR) 10Dzahn: [C: 03+2] fix content type for HTML, it's not CSS [container/miscweb] - 10https://gerrit.wikimedia.org/r/752235 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [22:02:51] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Great news @Dzahn. I just got around to testing out the new permissions and it worked! I was able to ACK... [22:05:42] (03Merged) 10jenkins-bot: fix content type for HTML, it's not CSS [container/miscweb] - 10https://gerrit.wikimedia.org/r/752235 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [22:10:16] (03PS1) 10Nray: Fix TypeError: document.querySelectorAll(...).forEach is not a function [skins/Vector] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752766 (https://phabricator.wikimedia.org/T298910) [22:12:12] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) a:05jgleeson→03None [22:12:33] (03PS1) 10MSantos: maps: introduce imposm-geometry-import [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) [22:18:21] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) @jgleeson Thanks for confirming, great! We have a rotating clinic duty each week ha... [22:19:36] (03CR) 10Jdlrobson: [C: 03+1] Fix TypeError: document.querySelectorAll(...).forEach is not a function [skins/Vector] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752766 (https://phabricator.wikimedia.org/T298910) (owner: 10Nray) [22:22:47] (03PS1) 10Dzahn: miscweb: bump staging and prod version to 2022-01-10-220730-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/752750 (https://phabricator.wikimedia.org/T281538) [22:23:25] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:11] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging and prod version to 2022-01-10-220730-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/752750 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [22:28:53] (03Merged) 10jenkins-bot: miscweb: bump staging and prod version to 2022-01-10-220730-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/752750 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [22:32:52] (03PS1) 10Jdlrobson: Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 [22:33:25] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:12] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [22:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:02] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [22:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:27] (03PS1) 10Cwhite: hiera: add opensearch production configuration (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/752755 (https://phabricator.wikimedia.org/T288621) [22:43:29] (03PS1) 10Cwhite: site: reprovision eqiad logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/752756 (https://phabricator.wikimedia.org/T288621) [22:45:24] (03PS9) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [22:46:47] (03PS10) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [22:48:13] (03CR) 10Bking: [C: 03+2] query_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [22:48:34] (03PS3) 10Bking: query_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [23:02:45] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:07:34] (03PS1) 10Jdlrobson: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) [23:20:33] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:25] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:47:24] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10elappen-WMF) Commenting on behalf of the staff working on Wikimania to note that we'd like to move wi...