[00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995364 [00:38:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995364 (owner: 10TrainBranchBot) [00:42:35] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:37] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [00:59:41] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [01:01:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995364 (owner: 10TrainBranchBot) [01:29:42] (03PS5) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [01:29:44] (03PS5) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [01:37:45] PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:32] (03CR) 10RLazarus: Helm chart for k8s-controller-sidecars (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:46:03] (03CR) 10RLazarus: admin_ng: Install k8s-controller-sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:59:09] RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:41] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:39:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:15] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:09:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:41] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:32:47] (03PS11) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [03:33:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [03:34:03] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:46:17] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:35] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 12547 MB (5% inode=67%): /tmp 12547 MB (5% inode=67%): /var/tmp 12547 MB (5% inode=67%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [03:47:43] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:47] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:51] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [06:10:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [06:11:25] !log Drop mathoid, mathlatexml tables T355050 [06:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:29] T355050: SQL: DROP TABLE IF EXISTS mathoid, mathlatexml; - https://phabricator.wikimedia.org/T355050 [06:14:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [06:15:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [06:15:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56170 and previous config saved to /var/cache/conftool/dbconfig/20240205-061511-marostegui.json [06:15:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:15:53] (03PS1) 10Marostegui: db1173: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/996983 [06:17:25] (03CR) 10Marostegui: [C: 03+2] db1173: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/996983 (owner: 10Marostegui) [06:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:22:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56171 and previous config saved to /var/cache/conftool/dbconfig/20240205-062203-marostegui.json [06:22:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:37:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P56172 and previous config saved to /var/cache/conftool/dbconfig/20240205-063709-marostegui.json [06:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P56173 and previous config saved to /var/cache/conftool/dbconfig/20240205-065216-marostegui.json [06:54:28] !log Drop indexes on site table on s8 T356417 [06:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:33] T356417: Drop random indexes of sites table in production - https://phabricator.wikimedia.org/T356417 [06:56:13] !log dbamaint Drop mathoid, mathlatexml tables T355050 [06:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:16] T355050: SQL: DROP TABLE IF EXISTS mathoid, mathlatexml; - https://phabricator.wikimedia.org/T355050 [06:56:19] !log dbmaint Drop indexes on site table on s8 T356417 [06:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56174 and previous config saved to /var/cache/conftool/dbconfig/20240205-070723-marostegui.json [07:07:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:07:36] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:07:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:07:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56175 and previous config saved to /var/cache/conftool/dbconfig/20240205-070745-marostegui.json [07:12:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56176 and previous config saved to /var/cache/conftool/dbconfig/20240205-071259-marostegui.json [07:13:14] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:18:57] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:28:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P56177 and previous config saved to /var/cache/conftool/dbconfig/20240205-072805-marostegui.json [07:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:37:21] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 112 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:43:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P56178 and previous config saved to /var/cache/conftool/dbconfig/20240205-074312-marostegui.json [07:46:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [07:48:25] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 42 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:50:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [07:50:49] (03PS1) 10Zabe: namespaceDupes: Reduce batchsize to 100 for link update [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995242 [07:51:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [07:55:22] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Illegitimate Barrister" . # T356607 [07:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] T356607: Server side upload for Illegitimate Barrister - https://phabricator.wikimedia.org/T356607 [07:56:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [07:58:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56179 and previous config saved to /var/cache/conftool/dbconfig/20240205-075818-marostegui.json [07:58:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:58:23] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:58:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [07:58:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:58:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56180 and previous config saved to /var/cache/conftool/dbconfig/20240205-075856-marostegui.json [07:59:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [08:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:49] * zabe is going to backport 2 patches [08:01:09] (03PS2) 10Zabe: namespaceDupes: Reduce batchsize to 100 for link update [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995242 [08:01:15] (03CR) 10Zabe: [C: 03+2] specials: Remove null comments from formatter on Special:ProtectedPages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995232 (https://phabricator.wikimedia.org/T356337) (owner: 10Jforrester) [08:01:21] (03CR) 10Zabe: [C: 03+2] namespaceDupes: Reduce batchsize to 100 for link update [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995242 (owner: 10Zabe) [08:01:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:04:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56181 and previous config saved to /var/cache/conftool/dbconfig/20240205-080407-marostegui.json [08:04:19] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:04:31] (03PS1) 10Muehlenhoff: Remove access for nskaggs [puppet] - 10https://gerrit.wikimedia.org/r/997251 [08:04:49] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:06:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nskaggs [puppet] - 10https://gerrit.wikimedia.org/r/997251 (owner: 10Muehlenhoff) [08:08:34] (03PS1) 10Slyngshede: SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 [08:09:51] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Nskaggs out of all services on: 2205 hosts [08:10:11] (03CR) 10CI reject: [V: 04-1] SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede) [08:10:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nskaggs out of all services on: 2205 hosts [08:11:56] (03PS1) 10Muehlenhoff: Replace Nicholas as group approver with Joanna [puppet] - 10https://gerrit.wikimedia.org/r/997254 [08:14:21] (03CR) 10Muehlenhoff: [C: 03+2] Replace Nicholas as group approver with Joanna [puppet] - 10https://gerrit.wikimedia.org/r/997254 (owner: 10Muehlenhoff) [08:19:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P56182 and previous config saved to /var/cache/conftool/dbconfig/20240205-081914-marostegui.json [08:20:33] (03Merged) 10jenkins-bot: specials: Remove null comments from formatter on Special:ProtectedPages [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995232 (https://phabricator.wikimedia.org/T356337) (owner: 10Jforrester) [08:22:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:22:33] (03Merged) 10jenkins-bot: namespaceDupes: Reduce batchsize to 100 for link update [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/995242 (owner: 10Zabe) [08:24:16] !log zabe@deploy2002 Started scap: Backport for [[gerrit:995232|specials: Remove null comments from formatter on Special:ProtectedPages (T356337)]], [[gerrit:995242|namespaceDupes: Reduce batchsize to 100 for link update]] [08:24:20] T356337: PHP Warning: Comment text should not be null! via Special:ProtectedPages - https://phabricator.wikimedia.org/T356337 [08:26:41] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10MoritzMuehlenhoff) @Jclark-ctr Since the server is OOW; do you have a spare disk from a decommed server? [08:31:45] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:32:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:32:25] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:34:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P56183 and previous config saved to /var/cache/conftool/dbconfig/20240205-083420-marostegui.json [08:35:07] !log zabe@deploy2002 zabe and jforrester: Backport for [[gerrit:995232|specials: Remove null comments from formatter on Special:ProtectedPages (T356337)]], [[gerrit:995242|namespaceDupes: Reduce batchsize to 100 for link update]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:35:11] T356337: PHP Warning: Comment text should not be null! via Special:ProtectedPages - https://phabricator.wikimedia.org/T356337 [08:35:24] !log uploaded openjdk-8 8u402-ga-2~deb11u1 (latest Java 8 security fixes for Bullseye) [08:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:29] !log zabe@deploy2002 zabe and jforrester: Continuing with sync [08:36:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:40:59] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 48 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:42:35] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:995232|specials: Remove null comments from formatter on Special:ProtectedPages (T356337)]], [[gerrit:995242|namespaceDupes: Reduce batchsize to 100 for link update]] (duration: 18m 18s) [08:42:38] T356337: PHP Warning: Comment text should not be null! via Special:ProtectedPages - https://phabricator.wikimedia.org/T356337 [08:46:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:49:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56184 and previous config saved to /var/cache/conftool/dbconfig/20240205-084927-marostegui.json [08:49:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:49:31] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:49:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:49:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56185 and previous config saved to /var/cache/conftool/dbconfig/20240205-084949-marostegui.json [08:57:33] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 17 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56186 and previous config saved to /var/cache/conftool/dbconfig/20240205-085746-marostegui.json [08:57:51] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:05:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:09:44] !log installing perf updates on bullseye hosts [09:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:41] !log installing Java 8 security updates [09:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P56187 and previous config saved to /var/cache/conftool/dbconfig/20240205-091253-marostegui.json [09:15:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:19:33] I idly wonder if that alert needs tweaking... [09:23:47] !log INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s3.2024-02-05--04-31-35.tar.gz from dbprov1001.eqiad.wmnet to ['db1240.eqiad.wmnet']:['/srv/sqldata.s3'] (403339334478 bytes) [09:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P56188 and previous config saved to /var/cache/conftool/dbconfig/20240205-092800-marostegui.json [09:38:56] !log restarted db2097 [09:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:57] !log installing runc security updates on gitlab runners [09:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56189 and previous config saved to /var/cache/conftool/dbconfig/20240205-094306-marostegui.json [09:43:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:43:11] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:43:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56190 and previous config saved to /var/cache/conftool/dbconfig/20240205-094329-marostegui.json [09:45:50] (03CR) 10Jelto: [C: 03+1] "lgtm. However the docs for dockerfile-upstream state:" [puppet] - 10https://gerrit.wikimedia.org/r/995343 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [09:50:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56191 and previous config saved to /var/cache/conftool/dbconfig/20240205-095005-marostegui.json [09:50:13] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:54:23] !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [10:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P56192 and previous config saved to /var/cache/conftool/dbconfig/20240205-100511-marostegui.json [10:07:00] (03PS2) 10Slyngshede: SystemdUnitFailed: Increase the severity of a failed unit to critical. [alerts] - 10https://gerrit.wikimedia.org/r/997253 [10:16:00] Emperor: It needs some tweaking, because it's got the same thresholds as the other appservers alert but is only called internally. It's still in the critical path for a few user-facing things though. It needs informations, because it just says high, but doesn't say how much. It needs a lot of things x) [10:16:31] (only called internally = the deployment is) [10:16:48] it also tends to start alerting just when the latency actually starts to go down [10:18:46] Something happened at 8 this morning though, it instantly jumped half a second on p99, and p75 is staying way higher than I'd like [10:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P56193 and previous config saved to /var/cache/conftool/dbconfig/20240205-102018-marostegui.json [10:26:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:28:40] I think we need to make that alert rps aware too, because p50 on 200rps is way too easily influenced by a bunch of slow-ish requests [10:31:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:32:49] !log installing runc security updates on DSE cluster [10:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56194 and previous config saved to /var/cache/conftool/dbconfig/20240205-103525-marostegui.json [10:35:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:35:29] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:35:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56195 and previous config saved to /var/cache/conftool/dbconfig/20240205-103547-marostegui.json [10:38:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [10:42:01] !log installing runc security updates on releases hosts [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56196 and previous config saved to /var/cache/conftool/dbconfig/20240205-104230-marostegui.json [10:42:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:45:21] (03PS1) 10Arnaudb: mariadb: pooling db1244 db1246, prepare db1235 [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) [10:47:09] (03CR) 10Marostegui: [C: 04-1] "db1235 needs to be removed from insetup" [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:47:47] (03PS2) 10Arnaudb: mariadb: pooling db1244 db1246, prepare db1235 [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) [10:48:02] (03CR) 10Arnaudb: "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:48:15] (03CR) 10Marostegui: "The hosts that will get pooled, are they fully green on Icinga?" [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:52:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:53:51] (03CR) 10Arnaudb: "all green rn, `prometheus-mysqld-exporter.service` was showing up as failing but was not impacting service. It looks like a misconfig arte" [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:54:12] !log update thirdparty/kubeadm-k8s-1-23 packages for T356507 in apt1001 [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] (03PS1) 10Brouberol: Enable dse k8s workers -> an-mariadb1001:3306 traffic [puppet] - 10https://gerrit.wikimedia.org/r/997337 (https://phabricator.wikimedia.org/T356623) [10:54:48] (03CR) 10Marostegui: [C: 03+1] mariadb: pooling db1244 db1246, prepare db1235 [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:55:34] (03PS2) 10Brouberol: Enable dse k8s workers -> an-mariadb1001:3306 traffic [puppet] - 10https://gerrit.wikimedia.org/r/997337 (https://phabricator.wikimedia.org/T356623) [10:56:08] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997337 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [10:56:29] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/997337 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [10:57:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P56197 and previous config saved to /var/cache/conftool/dbconfig/20240205-105736-marostegui.json [10:57:47] (03CR) 10Brouberol: [C: 03+2] Enable dse k8s workers -> an-mariadb1001:3306 traffic [puppet] - 10https://gerrit.wikimedia.org/r/997337 (https://phabricator.wikimedia.org/T356623) (owner: 10Brouberol) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1100) [11:00:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:05:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:07:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:12:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:12:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P56198 and previous config saved to /var/cache/conftool/dbconfig/20240205-111243-marostegui.json [11:18:58] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:19:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:20:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:25:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:27:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56199 and previous config saved to /var/cache/conftool/dbconfig/20240205-112750-marostegui.json [11:27:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:27:55] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:28:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:28:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56200 and previous config saved to /var/cache/conftool/dbconfig/20240205-112812-marostegui.json [11:28:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Volans) @wiki_willy @Jclark-ctr @RobH As I see that Rob is out this week, to unblock the rest of SREs with any DNS-related change in Netbox I'm runn... [11:30:58] !log volans@cumin1002 START - Cookbook sre.dns.netbox [11:31:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:33:07] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Unracked cloudcephosd10[35-39,40] - volans@cumin1002" [11:33:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56201 and previous config saved to /var/cache/conftool/dbconfig/20240205-113330-marostegui.json [11:33:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:33:57] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Unracked cloudcephosd10[35-39,40] - volans@cumin1002" [11:33:57] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:34:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Volans) Done. FYI `cloudcephosd1040` had a wrong WMF asset tag `wmf108805`, I guess it was supposed to be `wmf10805` [11:36:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:41:08] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) [11:41:39] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:42:22] (03PS1) 10Clément Goubert: mw-on-k8s: Tweak rps threshold for latency alert [alerts] - 10https://gerrit.wikimedia.org/r/997346 [11:43:09] (03CR) 10David Caro: [C: 03+1] aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:44:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:46:04] (03CR) 10FNegri: "Maybe we could stick with v23 and prevent the upgrade to v24, just to minimize the changes?" [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:48:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P56203 and previous config saved to /var/cache/conftool/dbconfig/20240205-114837-marostegui.json [11:48:51] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version even further [puppet] - 10https://gerrit.wikimedia.org/r/997353 (https://phabricator.wikimedia.org/T356629) [11:49:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "yes, sent https://gerrit.wikimedia.org/r/997353" [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:50:50] (03CR) 10David Caro: [C: 03+1] aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version even further [puppet] - 10https://gerrit.wikimedia.org/r/997353 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:50:54] (03CR) 10FNegri: "Hmm it looks like v23 is no longer supported and they did not release a v23 patch with the updated runc dependency. So we have to go with " [puppet] - 10https://gerrit.wikimedia.org/r/997345 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:53:59] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997353 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:54:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s-1-23: limit docker version even further [puppet] - 10https://gerrit.wikimedia.org/r/997353 (https://phabricator.wikimedia.org/T356629) (owner: 10Arturo Borrero Gonzalez) [11:58:56] !log downgrade docker to v23 in thirdparty/kubeadm-k8s-1-23 for T356507 and T356629 in apt1001 [11:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:00] T356629: kubelet: cannot work with docker >= v25 - https://phabricator.wikimedia.org/T356629 [12:03:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P56204 and previous config saved to /var/cache/conftool/dbconfig/20240205-120343-marostegui.json [12:15:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:18:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56205 and previous config saved to /var/cache/conftool/dbconfig/20240205-121850-marostegui.json [12:18:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:18:54] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:19:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56206 and previous config saved to /var/cache/conftool/dbconfig/20240205-121911-marostegui.json [12:20:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:20:50] !log jnuche@deploy2002 Installing scap version "4.65.3" for 505 hosts [12:21:48] !log jnuche@deploy2002 Installing scap version "4.65.3" for 505 hosts [12:22:49] !log jnuche@deploy2002 Installation of scap version "4.65.3" completed for 505 hosts [12:23:46] !log installing runc security updates on aux-k8s cluster [12:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:24:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56207 and previous config saved to /var/cache/conftool/dbconfig/20240205-122416-marostegui.json [12:24:21] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:29:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:31:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [12:36:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org [12:37:22] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:39:18] (03CR) 10Arnaudb: [C: 03+2] mariadb: pooling db1244 db1246, prepare db1235 [puppet] - 10https://gerrit.wikimedia.org/r/995365 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P56208 and previous config saved to /var/cache/conftool/dbconfig/20240205-123923-marostegui.json [12:39:32] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) >>! In T356174#9502325, @MoritzMuehlenhoff wrote: > To doublecheck and to be sure there are no other lingering issues, I'll: > * Check all... [12:40:26] !log installing bind9 security updates (client-side libs/tools only) [12:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P56209 and previous config saved to /var/cache/conftool/dbconfig/20240205-124807-arnaudb.json [12:48:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P56210 and previous config saved to /var/cache/conftool/dbconfig/20240205-124817-arnaudb.json [12:52:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: provisionning db1235.eqiad.wmnet - T344036 [12:52:52] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [12:53:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: provisionning db1235.eqiad.wmnet - T344036 [12:53:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: provisionning db1235.eqiad.wmnet - T344036 [12:53:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: provisionning db1235.eqiad.wmnet - T344036 [12:54:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db1135 in db1235 for T344036', diff saved to https://phabricator.wikimedia.org/P56211 and previous config saved to /var/cache/conftool/dbconfig/20240205-125444-arnaudb.json [12:54:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P56212 and previous config saved to /var/cache/conftool/dbconfig/20240205-125450-marostegui.json [13:01:24] !log installing perf updates on bookworm hosts [13:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:33] (03CR) 10Clément Goubert: P:httpbb: migrate tests from cumin1001 to cumin1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [13:03:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P56213 and previous config saved to /var/cache/conftool/dbconfig/20240205-130312-arnaudb.json [13:03:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P56214 and previous config saved to /var/cache/conftool/dbconfig/20240205-130323-arnaudb.json [13:03:43] (03PS1) 10Ladsgroup: Switch the pagelinks default to add read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997420 (https://phabricator.wikimedia.org/T351237) [13:03:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I'm not 100% sure this is the right way in absolute but let's keep tweaking the alerts if needed at a later time." [alerts] - 10https://gerrit.wikimedia.org/r/997346 (owner: 10Clément Goubert) [13:06:51] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Tweak rps threshold for latency alert [alerts] - 10https://gerrit.wikimedia.org/r/997346 (owner: 10Clément Goubert) [13:07:52] (03Merged) 10jenkins-bot: mw-on-k8s: Tweak rps threshold for latency alert [alerts] - 10https://gerrit.wikimedia.org/r/997346 (owner: 10Clément Goubert) [13:08:05] (03Abandoned) 10Slyngshede: P:url_downloader decommission Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/988481 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:09:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56215 and previous config saved to /var/cache/conftool/dbconfig/20240205-130956-marostegui.json [13:10:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:12:26] (03CR) 10Volans: [C: 04-1] "reply inline (the vote is just to not remove the existing one)" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [13:16:19] (03PS1) 10Marostegui: eventlogging_*: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/997422 [13:16:37] (03PS1) 10Brouberol: Set large pod CPU/memory limits for the superset namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/997423 (https://phabricator.wikimedia.org/T352166) [13:16:40] (03CR) 10Jforrester: "Our coding standards are pretty clear that vertical alignment with spaces should never be used, but it is moderately common here. Meh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe) [13:18:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P56216 and previous config saved to /var/cache/conftool/dbconfig/20240205-131817-arnaudb.json [13:18:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P56217 and previous config saved to /var/cache/conftool/dbconfig/20240205-131828-arnaudb.json [13:19:25] (03CR) 10CI reject: [V: 04-1] Set large pod CPU/memory limits for the superset namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/997423 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [13:21:22] (03CR) 10Marostegui: [C: 03+2] eventlogging_*: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/997422 (owner: 10Marostegui) [13:21:57] (03Merged) 10jenkins-bot: eventlogging_*: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/997422 (owner: 10Marostegui) [13:26:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:33:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P56218 and previous config saved to /var/cache/conftool/dbconfig/20240205-133322-arnaudb.json [13:33:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P56219 and previous config saved to /var/cache/conftool/dbconfig/20240205-133333-arnaudb.json [13:35:06] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [13:40:00] (03PS2) 10Brouberol: Set large pod CPU/memory limits for the superset namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/997423 (https://phabricator.wikimedia.org/T352166) [13:42:54] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) [13:45:15] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host debmonitor2003.codfw.wmnet with OS bookworm [13:45:18] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm [13:45:28] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low [13:46:07] (03CR) 10Btullis: [C: 03+1] "Looks good. We will be running a memcached pod with some GBs of RAM for superset-production in this manespace too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/997423 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [13:47:58] (03CR) 10Brouberol: [C: 03+2] Set large pod CPU/memory limits for the superset namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/997423 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [13:48:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P56221 and previous config saved to /var/cache/conftool/dbconfig/20240205-134827-arnaudb.json [13:48:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P56222 and previous config saved to /var/cache/conftool/dbconfig/20240205-134838-arnaudb.json [13:50:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:50:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:57:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice, ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/997253 (owner: 10Slyngshede) [13:59:18] (03CR) 10Volans: [C: 04-1] "Message formatting issue, see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:01:48] (03Abandoned) 10Volans: varnish upload: throttle very large image [puppet] - 10https://gerrit.wikimedia.org/r/957893 (owner: 10Volans) [14:02:48] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on debmonitor2003.codfw.wmnet with reason: host reimage [14:03:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3315 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P56223 and previous config saved to /var/cache/conftool/dbconfig/20240205-140332-arnaudb.json [14:03:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3314 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P56224 and previous config saved to /var/cache/conftool/dbconfig/20240205-140343-arnaudb.json [14:05:42] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on debmonitor2003.codfw.wmnet with reason: host reimage [14:06:11] (03Abandoned) 10Volans: sre.mysql.upgrade: various improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) (owner: 10Volans) [14:07:40] (KubernetesRsyslogDown) firing: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:12:54] (03PS1) 10Muehlenhoff: Remove access for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/997433 [14:14:10] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Mo Houtti out of all services on: 2205 hosts [14:14:55] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mo Houtti out of all services on: 2205 hosts [14:16:09] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db1135.eqiad.wmnet onto db1235.eqiad.wmnet [14:17:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mhoutti [puppet] - 10https://gerrit.wikimedia.org/r/997433 (owner: 10Muehlenhoff) [14:18:18] !log hashar@deploy2002 Started deploy [integration/docroot@8e28943]: Update phpunit and npm dependencies (noop for prod) [14:18:24] !log hashar@deploy2002 Finished deploy [integration/docroot@8e28943]: Update phpunit and npm dependencies (noop for prod) (duration: 00m 06s) [14:20:13] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Eevans) >>! In T354499#9512258, @MoritzMuehlenhoff wrote: > @Jclark-ctr Since the server is OOW; do you have a spare disk from a decommed server? Actually, we've gone down that route (twice?) already. My thought was t... [14:26:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [14:27:34] (03CR) 10Bking: "The experimental failure is expected, as we're removing the current hostname from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/995223 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [14:27:40] (KubernetesRsyslogDown) resolved: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:28:08] jouncebot: now and next [14:28:08] For the next 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1400) [14:28:33] going to bounce prometheus@k8s in eqiad real quick [14:28:57] !log bounce prometheus@k8s and @k8s-aux in eqiad - T343529 [14:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:01] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [14:30:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1275/co" [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [14:32:43] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [14:33:41] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:34:14] (03CR) 10Giuseppe Lavagetto: "* Why not bookworm?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 (owner: 10Clément Goubert) [14:35:47] (03CR) 10Hashar: [C: 03+2] Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [14:36:27] (03CR) 10Clément Goubert: "No specific reason for not bookworm despite me just bumping one version and not thinking about going two up. We can definitely try a newer" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 (owner: 10Clément Goubert) [14:37:52] (03PS1) 10Hashar: Add rename-project plugin [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997437 (https://phabricator.wikimedia.org/T201953) [14:39:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:32] (03CR) 10Hashar: [C: 03+2] Add rename-project plugin [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997437 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [14:42:25] (03Merged) 10jenkins-bot: Add rename-project plugin [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/995035 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [14:42:31] (03Merged) 10jenkins-bot: Add rename-project plugin [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/997437 (https://phabricator.wikimedia.org/T201953) (owner: 10Hashar) [14:44:38] !log hashar@deploy2002 Started deploy [gerrit/gerrit@79dc8f5]: Add rename-project plugin - T201953 [14:44:42] T201953: Install rename-project plugin - https://phabricator.wikimedia.org/T201953 [14:44:45] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@79dc8f5]: Add rename-project plugin - T201953 (duration: 00m 07s) [14:50:18] (03PS8) 10Bking: sre.hosts.reimage: Suggest install-console for troubleshooting [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) [14:54:27] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Suggest install-console for troubleshooting [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [14:54:29] (03CR) 10Bking: sre.hosts.reimage: Suggest install-console for troubleshooting (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [14:56:10] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host debmonitor2003.codfw.wmnet with OS bookworm [14:56:46] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm completed: - debmonitor2003 (**WARN**) - Downtimed on... [14:59:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:32] (03PS5) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [15:01:52] 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) [15:02:08] (03PS9) 10Bking: sre.hosts.reimage: Suggest install-console for troubleshooting [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) [15:03:13] (03PS6) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [15:03:47] (03PS5) 10Clément Goubert: prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) [15:04:26] !log slyngshede@cumin1002 START - Cookbook sre.hosts.remove-downtime for debmonitor2003.codfw.wmnet [15:04:27] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for debmonitor2003.codfw.wmnet [15:05:47] (03CR) 10Majavah: [C: 03+2] P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [15:14:46] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) envoyproxy yet again failed to build its configuration file. Manually ran generation script and removed downtime. [15:14:54] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05In progress→03Resolved [15:15:26] (03PS1) 10Kamila Součková: site.pp: clean up mw*, no functional change [puppet] - 10https://gerrit.wikimedia.org/r/997450 [15:18:41] (KubernetesAPINotScrapable) resolved: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:18:59] (03CR) 10Clément Goubert: [C: 03+1] site.pp: clean up mw*, no functional change [puppet] - 10https://gerrit.wikimedia.org/r/997450 (owner: 10Kamila Součková) [15:19:41] (03CR) 10Kamila Součková: [C: 03+2] site.pp: clean up mw*, no functional change [puppet] - 10https://gerrit.wikimedia.org/r/997450 (owner: 10Kamila Součková) [15:28:38] (03PS7) 10Ssingh: dns: Don't disable puppet on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [15:29:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [15:30:05] (03CR) 10Ssingh: dns: Don't disable puppet on restarting wdns (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [15:31:05] (03CR) 10Clément Goubert: [V: 03+1] prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [15:32:07] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [15:32:09] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) (owner: 10Clément Goubert) [15:32:29] (03CR) 10CI reject: [V: 04-1] dns: Don't disable puppet on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [15:35:45] !log Building production images for 987443 - T283861 [15:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:49] T283861: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861 [15:38:38] (03PS1) 10Slyngshede: CI Fix broken tests. [software/bitu] - 10https://gerrit.wikimedia.org/r/997468 (https://phabricator.wikimedia.org/T355172) [15:38:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: elasticsearch::cloudelastic [15:41:23] (03PS1) 10Muehlenhoff: Switch cloudelastic to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997469 (https://phabricator.wikimedia.org/T349619) [15:42:44] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudelastic to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/997469 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:42:52] (03CR) 10JMeybohm: [C: 04-1] "Needs a Chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [15:43:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997487 (https://phabricator.wikimedia.org/T356649) [15:43:14] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/997470 [15:43:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997488 (https://phabricator.wikimedia.org/T356650) [15:43:36] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/997489 (https://phabricator.wikimedia.org/T356650) [15:44:09] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10Jelto) Thanks @akosiaris for the permissions and the additional context. I'm struggling a bit re-building the new etherpad-lite Debian package with the [docs](https:... [15:45:18] (03CR) 10Kamila Součková: "sorry, this one's a bit messy '^^" [puppet] - 10https://gerrit.wikimedia.org/r/997470 (owner: 10Kamila Součková) [15:47:58] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Jclark-ctr) @Eevans Would you mind if swapped it again possibly 3rd times the charm. all bays in server are filled. possibly next if it fails again would be backplane swap [15:49:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: elasticsearch::cloudelastic [15:50:29] (03PS187) 10Arnaudb: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) [15:51:02] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Eevans) >>! In T354499#9513521, @Jclark-ctr wrote: > @Eevans Would you mind if swapped it again possibly 3rd times the charm. all bays in server are filled. possibly next if it fails again would be backplane swap Ok... [15:52:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:52:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P56228 and previous config saved to /var/cache/conftool/dbconfig/20240205-155210-arnaudb.json [15:52:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P56229 and previous config saved to /var/cache/conftool/dbconfig/20240205-155212-arnaudb.json [15:52:53] (03CR) 10JMeybohm: [C: 03+1] helm-state-metrics: Declare the healthcheck port [deployment-charts] - 10https://gerrit.wikimedia.org/r/992731 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [15:53:03] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Jclark-ctr) performed drive swap. blew out slot with compressed air If fails again we would need to look at possibly backplane swap [15:55:23] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/997480 (https://phabricator.wikimedia.org/T354959) [15:58:29] (03PS3) 10Clare Ming: Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) [16:00:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [16:00:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [16:00:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [16:00:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [16:00:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56230 and previous config saved to /var/cache/conftool/dbconfig/20240205-160055-marostegui.json [16:01:26] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:02:01] (03PS1) 10Muehlenhoff: Remove now obsolete scap config for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/997483 (https://phabricator.wikimedia.org/T241049) [16:02:18] (03PS1) 10Slyngshede: Make it clear what password is being reset [software/bitu] - 10https://gerrit.wikimedia.org/r/997484 (https://phabricator.wikimedia.org/T356409) [16:02:30] (03PS2) 10Muehlenhoff: Remove now obsolete scap config for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/997483 (https://phabricator.wikimedia.org/T241049) [16:02:44] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts debmonitor2002.codfw.wmnet [16:03:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56231 and previous config saved to /var/cache/conftool/dbconfig/20240205-160316-marostegui.json [16:04:57] (03PS1) 10Arnaudb: mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) [16:06:30] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Eevans) The device has been added to md2 (it is currently rebuilding). [16:07:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:07:14] (03CR) 10CI reject: [V: 04-1] mariadb: will test converting instances [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:07:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P56232 and previous config saved to /var/cache/conftool/dbconfig/20240205-160715-arnaudb.json [16:07:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P56233 and previous config saved to /var/cache/conftool/dbconfig/20240205-160717-arnaudb.json [16:08:41] (03PS188) 10Arnaudb: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) [16:09:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: debmonitor2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:10:02] (03CR) 10Ahmon Dancy: "That's what the stock buildkit Dockerfile uses (xref ). We use that Dockerfil" [puppet] - 10https://gerrit.wikimedia.org/r/995343 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [16:10:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: debmonitor2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:10:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts debmonitor2002.codfw.wmnet [16:11:37] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts debmonitor1002.eqiad.wmnet [16:12:23] (03CR) 10Marostegui: [C: 04-1] mariadb: will test converting instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997490 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:16:24] (03CR) 10Muehlenhoff: "Looks good, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/995019 (https://phabricator.wikimedia.org/T355612) (owner: 10Arnaudb) [16:17:50] (03CR) 10CI reject: [V: 04-1] mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:18:13] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:18:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P56234 and previous config saved to /var/cache/conftool/dbconfig/20240205-161822-marostegui.json [16:19:48] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:20:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: debmonitor1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:21:54] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:22:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P56235 and previous config saved to /var/cache/conftool/dbconfig/20240205-162220-arnaudb.json [16:22:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P56236 and previous config saved to /var/cache/conftool/dbconfig/20240205-162222-arnaudb.json [16:23:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: debmonitor1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:23:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts debmonitor1002.eqiad.wmnet [16:25:12] (03PS1) 10Slyngshede: Password reset: Allow signed in users to navigate. [software/bitu] - 10https://gerrit.wikimedia.org/r/997506 (https://phabricator.wikimedia.org/T355907) [16:25:23] (03CR) 10Ssingh: "volans: I know why this is failing but I am not sure I understand the abstractions enough to fix this quickly. Would you mind giving some " [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [16:28:49] (03CR) 10Ssingh: [C: 03+2] admin: add Chris Dobbins to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/994791 (owner: 10Ssingh) [16:29:37] (03PS2) 10Ssingh: admin: add Chris Dobbins to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/994791 [16:30:05] jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1630). [16:32:52] (03CR) 10Jelto: [C: 03+2] Temporarily enable Dockerfile frontend on trusted runners (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/995343 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [16:32:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:33:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P56237 and previous config saved to /var/cache/conftool/dbconfig/20240205-163329-marostegui.json [16:37:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P56238 and previous config saved to /var/cache/conftool/dbconfig/20240205-163725-arnaudb.json [16:37:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P56239 and previous config saved to /var/cache/conftool/dbconfig/20240205-163727-arnaudb.json [16:37:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:39:34] (03PS3) 10Zoranzoki21: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) [16:39:43] (03CR) 10Ssingh: [C: 03+2] admin: add Chris Dobbins to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/994791 (owner: 10Ssingh) [16:39:53] (03CR) 10Ssingh: [C: 03+2] "Rebasing, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/994791 (owner: 10Ssingh) [16:42:29] !log adding cdobbins to cn=wmf and cn=ops [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:34] !log pruning unneeded openjdk-17-jre-headless packages on sessionstore* hosts [16:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:57] (03CR) 10Arnaudb: admin: add sbailey to deployment group and add key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995019 (https://phabricator.wikimedia.org/T355612) (owner: 10Arnaudb) [16:48:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T355609)', diff saved to https://phabricator.wikimedia.org/P56240 and previous config saved to /var/cache/conftool/dbconfig/20240205-164836-marostegui.json [16:48:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:48:40] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:48:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:48:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56241 and previous config saved to /var/cache/conftool/dbconfig/20240205-164859-marostegui.json [16:49:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [16:50:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [16:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56242 and previous config saved to /var/cache/conftool/dbconfig/20240205-165120-marostegui.json [16:51:29] !log pruning unneeded openjdk-17-jre-headless packages on cassandra-dev* hosts [16:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P56243 and previous config saved to /var/cache/conftool/dbconfig/20240205-165230-arnaudb.json [16:52:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P56244 and previous config saved to /var/cache/conftool/dbconfig/20240205-165232-arnaudb.json [16:53:43] (03CR) 10Volans: "No worries, see inline. Just to clarify, you want to stop bird before the reboot is issued?" [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [16:55:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [16:59:26] (03CR) 10Ssingh: "Yes, before the reboot. Before restart as well but I had a question: given we have two daemons here, I am assuming both should be restarte" [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:00:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995019 (https://phabricator.wikimedia.org/T355612) (owner: 10Arnaudb) [17:03:29] (03CR) 10Arnaudb: [C: 03+2] admin: add sbailey to deployment group and add key [puppet] - 10https://gerrit.wikimedia.org/r/995019 (https://phabricator.wikimedia.org/T355612) (owner: 10Arnaudb) [17:04:05] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/997483 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [17:05:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10ABran-WMF) 05In progress→03Resolved p:05Triage→03Medium a:05thcipriani→03ABran-WMF [17:05:35] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:06:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P56245 and previous config saved to /var/cache/conftool/dbconfig/20240205-170627-marostegui.json [17:07:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246:3312 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P56246 and previous config saved to /var/cache/conftool/dbconfig/20240205-170735-arnaudb.json [17:07:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1244:3314 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P56247 and previous config saved to /var/cache/conftool/dbconfig/20240205-170737-arnaudb.json [17:11:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db1135.eqiad.wmnet onto db1235.eqiad.wmnet [17:12:15] (03CR) 10Ssingh: "Is the recommendation for when we stop a service before calling _restart_daemons_action to downtime all those daemons as well? That seems " [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:14:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] helm-state-metrics: Declare the healthcheck port [deployment-charts] - 10https://gerrit.wikimedia.org/r/992731 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [17:15:02] (03PS8) 10Ssingh: dns: Don't disable puppet on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:17:03] jouncebot: now [17:17:03] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [17:17:23] I’ll run a quick namespaceDupes for T355195 then, should be harmless [17:17:24] T355195: Create a Draft namespace on English Wikiquote - https://phabricator.wikimedia.org/T355195 [17:17:54] Lucas_WMDE: you cursed yourself there ;) [17:18:12] :) [17:18:14] (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/997470 (owner: 10Kamila Součková) [17:18:27] (03PS1) 10Ahmon Dancy: Temporarily enable Dockerfile frontend on trusted runners (part 2, rev 2) [puppet] - 10https://gerrit.wikimedia.org/r/997516 (https://phabricator.wikimedia.org/T356418) [17:18:29] (03Merged) 10jenkins-bot: helm-state-metrics: Declare the healthcheck port [deployment-charts] - 10https://gerrit.wikimedia.org/r/992731 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [17:19:16] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes enwikiquote --add-prefix='Wikiquote:T355195/' --fix [17:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:30] (03CR) 10Ssingh: dns: Don't disable puppet on restarting wdns (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:20:34] * Lucas_WMDE done [17:21:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P56248 and previous config saved to /var/cache/conftool/dbconfig/20240205-172133-marostegui.json [17:23:39] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: remove cloud-init-finished flag if present [puppet] - 10https://gerrit.wikimedia.org/r/992677 (owner: 10Majavah) [17:25:05] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2003.codfw.wmnet with OS bullseye [17:34:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:36:03] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply updated JVM — T356648 - eevans@cumin1002 [17:36:07] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [17:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T355609)', diff saved to https://phabricator.wikimedia.org/P56249 and previous config saved to /var/cache/conftool/dbconfig/20240205-173640-marostegui.json [17:36:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:36:44] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:36:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:36:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:37:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:37:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56250 and previous config saved to /var/cache/conftool/dbconfig/20240205-173707-marostegui.json [17:39:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56251 and previous config saved to /var/cache/conftool/dbconfig/20240205-173928-marostegui.json [17:42:27] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [17:48:16] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:49:09] (03CR) 10Ssingh: "Thanks for the review and helping understand the structure." [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:49:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bullseye [17:50:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [17:53:28] (03CR) 10Ssingh: [C: 03+2] dns: Don't disable puppet on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [17:54:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P56255 and previous config saved to /var/cache/conftool/dbconfig/20240205-175435-marostegui.json [17:59:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bullseye [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1800) [18:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T1800). [18:02:11] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003.codfw.wmnet'] [18:02:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest2003.codfw.wmnet'] [18:03:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10Jhancock.wm) rack is physically prepped for tomorrow. [18:04:35] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough-drmrs and A:wikidough [18:08:40] (03PS1) 10Ssingh: sre: dns: update comments for Wikimedia DNS cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/997525 [18:09:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P56256 and previous config saved to /var/cache/conftool/dbconfig/20240205-180942-marostegui.json [18:15:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough-drmrs and A:wikidough [18:16:42] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/997470 (owner: 10Kamila Součková) [18:16:51] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:wikidough-drmrs and A:wikidough [18:17:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/997525 (owner: 10Ssingh) [18:17:23] (03CR) 10Ssingh: [C: 03+2] sre: dns: update comments for Wikimedia DNS cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/997525 (owner: 10Ssingh) [18:22:03] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1386.eqiad.wmnet with OS bullseye [18:22:40] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1388.eqiad.wmnet with OS bullseye [18:23:07] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1390.eqiad.wmnet with OS bullseye [18:23:32] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/997493 [18:24:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1392.eqiad.wmnet with OS bullseye [18:24:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T355609)', diff saved to https://phabricator.wikimedia.org/P56257 and previous config saved to /var/cache/conftool/dbconfig/20240205-182448-marostegui.json [18:24:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [18:24:53] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:25:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [18:25:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56258 and previous config saved to /var/cache/conftool/dbconfig/20240205-182511-marostegui.json [18:25:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [18:27:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56259 and previous config saved to /var/cache/conftool/dbconfig/20240205-182732-marostegui.json [18:29:04] (03PS2) 10Scott French: P:httpbb: migrate tests from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) [18:29:06] (03PS2) 10Scott French: P:httpbb: clean up after move from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/995109 (https://phabricator.wikimedia.org/T356054) [18:30:16] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:36:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh1002.wikimedia.org [18:36:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh1002.wikimedia.org [18:40:28] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:38] ^ expeted [18:40:38] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:40:41] ^ expected [18:40:47] (03CR) 10Scott French: "Thank you both again!" [puppet] - 10https://gerrit.wikimedia.org/r/995108 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [18:41:38] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:41:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 263, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:42:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P56260 and previous config saved to /var/cache/conftool/dbconfig/20240205-184239-marostegui.json [18:42:48] (03PS1) 10Majavah: Add a python-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/997537 [18:44:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1394.eqiad.wmnet with OS bullseye [18:45:15] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1396.eqiad.wmnet with OS bullseye [18:45:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1408.eqiad.wmnet with OS bullseye [18:46:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:48:48] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:51:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:51:32] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:52:30] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:33] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10bking) @ayounsi Apologies for the trouble, I didn't realize `sretest2005` was in active use. Unfortunately, I reimaged it while I was working on T3... [18:52:36] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:53:54] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:00] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:57:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P56261 and previous config saved to /var/cache/conftool/dbconfig/20240205-185745-marostegui.json [18:57:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:59:03] (03PS12) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [18:59:35] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:35] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:48] (03PS5) 10C. Scott Ananian: Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374) [19:05:27] (03PS1) 10Andrew Bogott: Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538 [19:05:37] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [19:09:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply updated JVM — T356648 - eevans@cumin1002 [19:09:08] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [19:09:43] (03CR) 10CI reject: [V: 04-1] Removed refs to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/997538 (owner: 10Andrew Bogott) [19:10:24] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply updated JVM — T356648 - eevans@cumin1002 [19:11:39] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:11:44] ^ expected [19:12:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:33] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:49] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56262 and previous config saved to /var/cache/conftool/dbconfig/20240205-191252-marostegui.json [19:12:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:12:58] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:13:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [19:13:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56263 and previous config saved to /var/cache/conftool/dbconfig/20240205-191315-marostegui.json [19:13:45] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 137, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:13:47] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:15:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56264 and previous config saved to /var/cache/conftool/dbconfig/20240205-191537-marostegui.json [19:20:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [19:25:53] (03PS1) 10Andrew Bogott: Designate: replace mcrouter as part of designate services [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) [19:25:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:17] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:25] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:45] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:wikidough-drmrs and A:wikidough [19:28:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [19:29:56] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2015.codfw.wmnet [19:30:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P56265 and previous config saved to /var/cache/conftool/dbconfig/20240205-193043-marostegui.json [19:31:11] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2016.codfw.wmnet [19:31:18] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2017.codfw.wmnet [19:31:25] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2018.codfw.wmnet [19:31:35] !log eevans@cumin1002 conftool action : set/weight=0; selector: cluster=restbase,dc=codfw,name=restbase2020.codfw.wmnet [19:31:55] (03PS2) 10Andrew Bogott: Designate: replace mcrouter as part of designate services [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) [19:32:11] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs-codfw: Apply updated JVM — T356648 - eevans@cumin1002 [19:32:15] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [19:34:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [19:36:21] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[04-12].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [19:39:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [19:40:33] (03PS1) 10Scott French: systemd::unit: clean up ownership file [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) [19:41:47] (03CR) 10CI reject: [V: 04-1] systemd::unit: clean up ownership file [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) (owner: 10Scott French) [19:41:57] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1386.eqiad.wmnet with OS bullseye [19:42:34] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1388.eqiad.wmnet with OS bullseye [19:43:03] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1390.eqiad.wmnet with OS bullseye [19:43:06] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum [19:43:21] (03PS4) 10Jdlrobson: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) [19:44:25] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1392.eqiad.wmnet with OS bullseye [19:45:26] (03PS2) 10Scott French: systemd::unit: clean up ownership file [puppet] - 10https://gerrit.wikimedia.org/r/997545 (https://phabricator.wikimedia.org/T356054) [19:45:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P56266 and previous config saved to /var/cache/conftool/dbconfig/20240205-194550-marostegui.json [19:47:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1386.eqiad.wmnet with OS bullseye [19:47:29] (03PS1) 10DLynch: Use decodeURI for comment ID searches as well as heading searches [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997279 (https://phabricator.wikimedia.org/T356199) [19:47:33] (03CR) 10Andrew Bogott: [C: 03+2] Designate: replace mcrouter as part of designate services [puppet] - 10https://gerrit.wikimedia.org/r/997539 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [19:50:13] (03PS1) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) [19:51:25] (03CR) 10CI reject: [V: 04-1] Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [19:55:46] (03PS2) 10Eevans: Remove EOL restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) [19:58:15] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2317.codfw.wmnet with OS bullseye [20:00:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T355609)', diff saved to https://phabricator.wikimedia.org/P56267 and previous config saved to /var/cache/conftool/dbconfig/20240205-200056-marostegui.json [20:00:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:01:11] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:01:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:01:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56268 and previous config saved to /var/cache/conftool/dbconfig/20240205-200119-marostegui.json [20:03:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56269 and previous config saved to /var/cache/conftool/dbconfig/20240205-200340-marostegui.json [20:04:24] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1394.eqiad.wmnet with OS bullseye [20:04:42] (03PS1) 10Majavah: libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) [20:04:44] (03PS1) 10Majavah: libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 [20:04:46] (03PS1) 10Majavah: libraryupgrader: base_dir is not optional [puppet] - 10https://gerrit.wikimedia.org/r/997549 [20:04:48] (03PS1) 10Majavah: libraryupgrader: remove libup-web config [puppet] - 10https://gerrit.wikimedia.org/r/997550 [20:04:50] (03PS1) 10Majavah: libraryupgrader: add toggle for worker services [puppet] - 10https://gerrit.wikimedia.org/r/997551 [20:05:46] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1408.eqiad.wmnet with OS bullseye [20:05:55] (03CR) 10CI reject: [V: 04-1] libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) (owner: 10Majavah) [20:06:08] (03CR) 10CI reject: [V: 04-1] libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 (owner: 10Majavah) [20:06:21] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1396.eqiad.wmnet with OS bullseye [20:06:28] (03CR) 10CI reject: [V: 04-1] libraryupgrader: base_dir is not optional [puppet] - 10https://gerrit.wikimedia.org/r/997549 (owner: 10Majavah) [20:06:42] (03PS2) 10Majavah: libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) [20:06:44] (03PS2) 10Majavah: libraryupgrader: use system docker on newer Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/997548 [20:06:46] (03PS2) 10Majavah: libraryupgrader: base_dir is not optional [puppet] - 10https://gerrit.wikimedia.org/r/997549 [20:06:48] (03PS2) 10Majavah: libraryupgrader: remove libup-web config [puppet] - 10https://gerrit.wikimedia.org/r/997550 [20:06:50] (03PS2) 10Majavah: libraryupgrader: add toggle for worker services [puppet] - 10https://gerrit.wikimedia.org/r/997551 [20:06:52] (03CR) 10CI reject: [V: 04-1] libraryupgrader: remove libup-web config [puppet] - 10https://gerrit.wikimedia.org/r/997550 (owner: 10Majavah) [20:11:23] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997546 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [20:12:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw2319:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:16:31] (03PS1) 10Andrew Bogott: Further attempt to fix memcached port for designate [puppet] - 10https://gerrit.wikimedia.org/r/997553 (https://phabricator.wikimedia.org/T350995) [20:16:33] (03PS1) 10Andrew Bogott: Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) [20:17:48] (03CR) 10Andrew Bogott: [C: 03+2] Further attempt to fix memcached port for designate [puppet] - 10https://gerrit.wikimedia.org/r/997553 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [20:17:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw2318:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:18:36] (03PS1) 10JHathaway: rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 [20:18:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P56270 and previous config saved to /var/cache/conftool/dbconfig/20240205-201847-marostegui.json [20:19:14] (03PS2) 10JHathaway: rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 [20:19:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [20:22:25] (03CR) 10CI reject: [V: 04-1] rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [20:22:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw2318:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:23:37] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:24:09] (03Abandoned) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [20:24:28] (03Abandoned) 10JHathaway: rspamd example hiera data, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/874945 (owner: 10JHathaway) [20:26:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum [20:26:45] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:27:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw2318:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:31:41] (03CR) 10Eevans: [C: 03+2] sessionstore: remove EOL hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994830 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [20:31:57] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: remove EOL hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994830 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [20:33:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P56271 and previous config saved to /var/cache/conftool/dbconfig/20240205-203353-marostegui.json [20:34:05] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [20:34:17] (03PS2) 10Andrew Bogott: Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) [20:34:19] (03PS1) 10Andrew Bogott: Yet further attempt to fix memcached port for designate [puppet] - 10https://gerrit.wikimedia.org/r/997561 (https://phabricator.wikimedia.org/T350995) [20:34:23] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [20:35:53] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [20:36:03] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [20:38:33] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [20:38:58] (03CR) 10Andrew Bogott: [C: 03+2] Yet further attempt to fix memcached port for designate [puppet] - 10https://gerrit.wikimedia.org/r/997561 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [20:39:49] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [20:42:48] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw2318:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:44:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[04-12].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [20:44:22] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [20:47:10] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[2001-2003].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [20:49:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T355609)', diff saved to https://phabricator.wikimedia.org/P56272 and previous config saved to /var/cache/conftool/dbconfig/20240205-204900-marostegui.json [20:49:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:49:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:49:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:49:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56273 and previous config saved to /var/cache/conftool/dbconfig/20240205-204922-marostegui.json [20:51:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56274 and previous config saved to /var/cache/conftool/dbconfig/20240205-205144-marostegui.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T2100). [21:00:05] cjming, cscott, Jdlrobson, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] o/ i can deploy [21:00:36] 👋🏻 [21:00:44] i'll start with mine and move down the queue [21:01:10] I’m semi-distracted, but also don’t have much to do to test mine. [21:01:10] thanks Kemayo -- i'll merge yours now so we don't have to wait forever [21:01:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) (owner: 10Clare Ming) [21:02:18] (03CR) 10Clare Ming: [C: 03+2] Use decodeURI for comment ID searches as well as heading searches [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997279 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [21:02:34] (03Merged) 10jenkins-bot: Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) (owner: 10Clare Ming) [21:02:47] !log cjming@deploy2002 Started scap: Backport for [[gerrit:992541|Update Android Metrics Platform stream configs (T355360)]] [21:02:58] T355360: [Java] Ensure that missing client data from Android article events is populated - https://phabricator.wikimedia.org/T355360 [21:03:39] cjming: also here [21:04:28] Jdlrobson: cool - i'll do yours next [21:05:29] !log cjming@deploy2002 cjming: Backport for [[gerrit:992541|Update Android Metrics Platform stream configs (T355360)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:37] !log cjming@deploy2002 cjming: Continuing with sync [21:05:53] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[2001-2003].codfw.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [21:06:01] unless cscott - are you around? you're actually next up in the queue - if not, i'll move onto Jon's [21:06:06] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [21:06:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P56275 and previous config saved to /var/cache/conftool/dbconfig/20240205-210650-marostegui.json [21:09:21] (03Merged) 10jenkins-bot: Use decodeURI for comment ID searches as well as heading searches [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997279 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [21:10:36] cjming i'm here sorry [21:10:57] cscott: great - just in time - i'll do yours next [21:11:01] great! [21:11:58] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:992541|Update Android Metrics Platform stream configs (T355360)]] (duration: 09m 10s) [21:12:02] T355360: [Java] Ensure that missing client data from Android article events is populated - https://phabricator.wikimedia.org/T355360 [21:12:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [21:12:51] cjming thanks [21:13:00] np! [21:15:25] (03PS3) 10Andrew Bogott: Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) [21:15:27] (03PS1) 10Andrew Bogott: designate pools.yaml: better distinguish between designate and pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/997576 (https://phabricator.wikimedia.org/T350995) [21:15:39] (03Merged) 10jenkins-bot: Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [21:16:16] (03PS14) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [21:16:17] !log cjming@deploy2002 Started scap: Backport for [[gerrit:991039|Turn on DT visual enhancements on wikitech (T355374)]] [21:16:21] T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374 [21:16:28] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2317.codfw.wmnet with OS bullseye [21:17:38] !log cjming@deploy2002 cjming and cscott: Backport for [[gerrit:991039|Turn on DT visual enhancements on wikitech (T355374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:42] cscott: can you test? [21:17:59] up, it's on the canary now? [21:18:17] ya - on test servers [21:19:51] huh, the wikimedia debug extension says "unsupported domain" for wikitech.wikimedia.org, which ought to be the only site affected [21:19:56] (03CR) 10Jforrester: [C: 03+1] libraryupgrader: migrate repo to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/997547 (https://phabricator.wikimedia.org/T341417) (owner: 10Majavah) [21:20:01] i'm not sure i actually know how to canary test on wikitech? [21:20:11] x-w-d does not support wikitech [21:20:24] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[1001-1006].eqiad.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [21:20:28] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [21:20:38] (03PS2) 10Andrew Bogott: designate pools.yaml: better distinguish between designate and pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/997576 (https://phabricator.wikimedia.org/T350995) [21:20:39] gtk -- should i sync anyway? [21:20:40] (03PS4) 10Andrew Bogott: Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) [21:20:59] (03PS15) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [21:21:01] cjming: i guess so? [21:21:09] k - syncing [21:21:12] !log cjming@deploy2002 cjming and cscott: Continuing with sync [21:21:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P56276 and previous config saved to /var/cache/conftool/dbconfig/20240205-212157-marostegui.json [21:22:18] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [21:24:02] (03CR) 10Andrew Bogott: [C: 03+2] designate pools.yaml: better distinguish between designate and pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/997576 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [21:24:51] (03CR) 10BCornwall: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [21:27:30] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:991039|Turn on DT visual enhancements on wikitech (T355374)]] (duration: 11m 12s) [21:27:33] cscott: should be live - hopefully won't need to revert but lmk [21:27:33] T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374 [21:27:53] (03PS5) 10Clare Ming: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:28:56] cjming: looks good, thanks. [21:29:16] cjming: ready? [21:29:41] Jdlrobson: ready! just rebasing yours [21:30:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:30:42] cscott: glad to hear it [21:31:14] (03Merged) 10jenkins-bot: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:31:28] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994224|Enable desktop diff HTML on mobile pages for all logged in users (T350181)]] [21:31:32] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [21:32:49] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:994224|Enable desktop diff HTML on mobile pages for all logged in users (T350181)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:52] Jdlrobson: can you test? [21:33:09] yep [21:34:18] LGTM cjming ! [21:34:32] cool - syncing [21:34:37] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [21:36:13] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [21:37:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T355609)', diff saved to https://phabricator.wikimedia.org/P56277 and previous config saved to /var/cache/conftool/dbconfig/20240205-213703-marostegui.json [21:37:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:37:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:37:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:37:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56278 and previous config saved to /var/cache/conftool/dbconfig/20240205-213726-marostegui.json [21:39:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56279 and previous config saved to /var/cache/conftool/dbconfig/20240205-213947-marostegui.json [21:40:19] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2003.codfw.wmnet with OS bullseye [21:40:57] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994224|Enable desktop diff HTML on mobile pages for all logged in users (T350181)]] (duration: 09m 29s) [21:41:03] Jdlrobson: should be live! [21:41:13] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [21:41:41] Kemayo: doing yours now [21:41:56] thanks cjming ! [21:42:05] np! [21:42:29] !log cjming@deploy2002 Started scap: Backport for [[gerrit:997279|Use decodeURI for comment ID searches as well as heading searches (T356199)]] [21:42:44] T356199: Diacritics in talk pages permalinks cause comments to not being found - https://phabricator.wikimedia.org/T356199 [21:43:15] (03PS1) 10Jdlrobson: Enable desktop diff for anonymous users as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) [21:43:49] !log cjming@deploy2002 cjming and kemayo: Backport for [[gerrit:997279|Use decodeURI for comment ID searches as well as heading searches (T356199)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:21] Kemayo: are you able to test? [21:44:48] cjming: Sure, I will look at it now. [21:49:03] cjming: It seems to be working fine [21:49:11] great - syncing [21:49:15] !log cjming@deploy2002 cjming and kemayo: Continuing with sync [21:49:35] (Sorry, took me a minute of forgetting that enwiki didn't have this specific feature enabled before I properly tested it elsewhere. 😅) [21:49:54] no worries :) [21:54:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P56280 and previous config saved to /var/cache/conftool/dbconfig/20240205-215454-marostegui.json [21:55:38] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997279|Use decodeURI for comment ID searches as well as heading searches (T356199)]] (duration: 13m 08s) [21:55:41] T356199: Diacritics in talk pages permalinks cause comments to not being found - https://phabricator.wikimedia.org/T356199 [21:55:46] Kemayo: should be live [21:56:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[1001-1006].eqiad.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [21:56:23] and with that - closing the window [21:56:25] cjming: Confirmed, working on live 👍🏻 [21:56:25] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [21:56:46] Kemayo: yay! [21:56:56] !log end of UTC late backport window [21:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:51] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply updated JVM — T356648 - eevans@cumin1002 [22:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240205T2200). [22:01:37] !log brett@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncmonitor1001.eqiad.wmnet [22:01:39] !log brett@cumin2002 START - Cookbook sre.dns.netbox [22:03:57] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncmonitor1001.eqiad.wmnet - brett@cumin2002" [22:04:51] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncmonitor1001.eqiad.wmnet - brett@cumin2002" [22:04:51] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:51] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache ncmonitor1001.eqiad.wmnet on all recursors [22:04:54] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncmonitor1001.eqiad.wmnet on all recursors [22:05:20] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncmonitor1001.eqiad.wmnet - brett@cumin2002" [22:05:54] !log Decommissioning Cassandra, sessionstore1001 — T353405 [22:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:57] T353405: Decommission sessionstore100[1-3]) - https://phabricator.wikimedia.org/T353405 [22:06:12] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncmonitor1001.eqiad.wmnet - brett@cumin2002" [22:10:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P56281 and previous config saved to /var/cache/conftool/dbconfig/20240205-221001-marostegui.json [22:12:32] PROBLEM - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:12:36] PROBLEM - cassandra-a service on sessionstore1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:12:56] PROBLEM - cassandra-a SSL 10.64.0.144:7000 on sessionstore1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:14:34] (03PS16) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [22:20:38] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Begin private IP migration for cloudelastic1009 [puppet] - 10https://gerrit.wikimedia.org/r/995223 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:20:38] PROBLEM - cassandra-a SSL 10.64.32.85:7000 on sessionstore1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:21:04] PROBLEM - cassandra-a CQL 10.64.32.85:9042 on sessionstore1002 is CRITICAL: connect to address 10.64.32.85 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:21:48] (03CR) 10Bking: [C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1009 [puppet] - 10https://gerrit.wikimedia.org/r/995223 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:21:51] that's me ^^^ I forgot to downtime those [22:22:52] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1001.eqiad.wmnet with reason: Decommissioning — T353405 [22:22:56] T353405: Decommission sessionstore100[1-3]) - https://phabricator.wikimedia.org/T353405 [22:23:06] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1001.eqiad.wmnet with reason: Decommissioning — T353405 [22:23:11] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1002.eqiad.wmnet with reason: Decommissioning — T353405 [22:23:25] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1002.eqiad.wmnet with reason: Decommissioning — T353405 [22:23:30] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1003.eqiad.wmnet with reason: Decommissioning — T353405 [22:23:42] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cloudelastic1009.wikimedia.org [22:23:45] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1003.eqiad.wmnet with reason: Decommissioning — T353405 [22:24:37] hi, the last deployment window seems to have caused https://phabricator.wikimedia.org/T356711 [22:25:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T355609)', diff saved to https://phabricator.wikimedia.org/P56282 and previous config saved to /var/cache/conftool/dbconfig/20240205-222507-marostegui.json [22:25:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1009.wikimedia.org for migrate cloudelastic1009 to private IP - bking@cumin2002 - T355617 [22:25:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:25:19] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1009.wikimedia.org for migrate cloudelastic1009 to private IP - bking@cumin2002 - T355617 [22:25:20] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:25:27] the mobile diff config change seems like the only vaguely related thing [22:25:32] anyone wants to try reverting? [22:27:29] MatmaRex: hey just catching up. [22:28:29] (03PS5) 10Andrew Bogott: Remove memcached cruft from codfw1dev cloudservice nodes [puppet] - 10https://gerrit.wikimedia.org/r/997554 (https://phabricator.wikimedia.org/T350995) [22:28:31] (03PS1) 10Andrew Bogott: Allow pdns to query designate-mdns on private interfaces [puppet] - 10https://gerrit.wikimedia.org/r/997597 (https://phabricator.wikimedia.org/T350995) [22:28:37] MatmaRex: i see what's happened here [22:28:40] I'll submit a patch [22:28:50] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudelastic1009.wikimedia.org with reason: T355617 [22:29:01] thanks [22:29:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudelastic1009.wikimedia.org with reason: T355617 [22:29:22] Basically a MobileFrontend hook is applying outside mobile view :/ [22:30:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply cluster settings before private IP migration - bking@cumin2002 - T355617 [22:31:54] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/997598 [22:32:25] looks straightforward, i gave it a +2 [22:33:16] cool. Do we have someone who can backport? [22:36:00] (03PS3) 10JHathaway: rsyslog: have rsyslog create its own files [puppet] - 10https://gerrit.wikimedia.org/r/997555 [22:38:53] i can't. you might have to ping deployers or something [22:40:18] i can deploy if you can test it? [22:40:38] ah lol there is a testing instruction on the task [22:41:58] i need to go, but y'all have my blessing. thanks :) [22:42:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [22:42:25] also, you should take credit for the fastest UBN bug fix ever ;) good night [22:42:37] (03PS1) 10Zabe: MobileFrontend hook should not apply outside mobile view [extensions/MobileFrontend] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997281 (https://phabricator.wikimedia.org/T356711) [22:43:10] (03CR) 10Zabe: [C: 03+2] MobileFrontend hook should not apply outside mobile view [extensions/MobileFrontend] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997281 (https://phabricator.wikimedia.org/T356711) (owner: 10Zabe) [22:43:28] good night :) [22:55:01] thanks zabe [22:55:04] i can test this [22:55:32] cool: [23:02:24] (03Merged) 10jenkins-bot: MobileFrontend hook should not apply outside mobile view [extensions/MobileFrontend] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997281 (https://phabricator.wikimedia.org/T356711) (owner: 10Zabe) [23:03:06] !log zabe@deploy2002 Started scap: Backport for [[gerrit:997281|MobileFrontend hook should not apply outside mobile view (T356711)]] [23:03:13] T356711: Preference "Do not show page content below diffs" stopped working on 5 Feb 2024 - https://phabricator.wikimedia.org/T356711 [23:04:29] !log zabe@deploy2002 zabe: Backport for [[gerrit:997281|MobileFrontend hook should not apply outside mobile view (T356711)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:04:32] Jdlrobson: could you test please? ^ :) [23:04:51] yep [23:05:41] zabe: lgtm [23:05:43] please sync [23:06:02] !log zabe@deploy2002 zabe: Continuing with sync [23:12:25] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:997281|MobileFrontend hook should not apply outside mobile view (T356711)]] (duration: 09m 18s) [23:12:31] Jdlrobson: should be live [23:12:35] T356711: Preference "Do not show page content below diffs" stopped working on 5 Feb 2024 - https://phabricator.wikimedia.org/T356711 [23:15:34] zabe: thanks! did you want to reply the VPT to update everyone? [23:17:14] sure, can do [23:18:09] :o [23:19:58] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-eqiad: Apply updated JVM — T356648 - eevans@cumin1002 [23:20:02] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [23:26:09] (03PS2) 10Zabe: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 [23:27:08] (03CR) 10CI reject: [V: 04-1] Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe) [23:27:33] (03PS3) 10Zabe: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 [23:30:08] (03CR) 10Zabe: "I looked through the stuff again and I think that we can keep the vertical alignment in ProductionServices.php:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe) [23:31:22] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[1028-1033].eqiad.wmnet: Apply updated JVM — T356648 - eevans@cumin1002 [23:31:26] T356648: Restart Cassandra on {restbase,sessionstore,aqs} to apply Java 1.8.0_402 upgrade - https://phabricator.wikimedia.org/T356648 [23:31:26] Thanks zabe for your help with that one! :) [23:31:44] yw :) [23:39:13] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply cluster settings before private IP migration - bking@cumin2002 - T355617 [23:39:16] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617