[00:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:08:51] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:32] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:11:42] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:21:44] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:35:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1002420 [00:39:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1002420 (owner: 10TrainBranchBot) [01:02:44] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357373 (10phaultfinder) [01:04:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1002420 (owner: 10TrainBranchBot) [01:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:12:28] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:15:09] (HelmReleaseBadStatus) firing: Helm release miscweb/research-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:32:06] (03CR) 10Ssingh: [C: 03+1] ncmonitor: Add partman config [puppet] - 10https://gerrit.wikimedia.org/r/1002674 (https://phabricator.wikimedia.org/T356710) (owner: 10BCornwall) [02:38:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:38:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:38:49] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T0300) [03:07:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.18 [core] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002421 (https://phabricator.wikimedia.org/T354436) [03:07:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.18 [core] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002421 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [03:13:49] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:04] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [03:19:22] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.13 ms [03:23:22] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [03:30:16] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.96 ms [03:30:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.18 [core] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002421 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T0400) [04:02:11] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.15 (duration: 02m 09s) [04:03:29] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002758 (https://phabricator.wikimedia.org/T354436) [04:03:31] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002758 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [04:04:12] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002758 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [04:04:39] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.18 refs T354436 [04:04:43] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [04:09:47] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:41] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T357377 (10phaultfinder) [04:57:15] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.18 refs T354436 (duration: 52m 36s) [04:57:22] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [05:15:10] (HelmReleaseBadStatus) firing: Helm release miscweb/research-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:23:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T357379 (10phaultfinder) [05:32:18] PROBLEM - MD RAID on mw2442 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:32:19] ACKNOWLEDGEMENT - MD RAID on mw2442 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T357380 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:32:24] 10SRE, 10ops-codfw: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380 (10ops-monitoring-bot) [05:54:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:13:30] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) https://commons.wikimedia.org/wiki/File:Algol,_Trag%C3%B6die_der_Macht_(1920)_by_Hans_Werckmeister.webm... [06:13:42] (03PS1) 10Brian Wolff: Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002813 (https://phabricator.wikimedia.org/T191804) [06:14:37] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, and 3 others: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) [06:28:12] PROBLEM - Disk space on vrts1002 is CRITICAL: DISK CRITICAL - free space: /srv/otrs-data 18786 MB (3% inode=61%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [06:51:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:54:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 4.682 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T0700). nyaa~ [07:00:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [07:01:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [07:01:10] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:02:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:14] (03CR) 10Stevemunene: [C: 03+1] idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [07:46:14] (03PS1) 10Giuseppe Lavagetto: Remove useless fixtures from services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002917 [07:58:58] (03CR) 10Muehlenhoff: [C: 03+1] "This looks good, but just fot the avoidance of doubt since the patch header states "idp": This only enables the service for idp-test.w.o, " [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [08:00:00] (03Restored) 10Ammarpad: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad) [08:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T0800). [08:00:05] bawolff: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:11] * bawolff waves [08:00:17] (03PS2) 10Ammarpad: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) [08:09:47] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:14] (03Abandoned) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [08:13:23] bawolff: is your patch being deployed? [08:13:55] hashar: not yet [08:14:43] who knows what is going to happen in the chain between the user and the Swift storage :D [08:15:20] then I guess the uploaded files are chunked? There is also something on the apache/php side iirc [08:15:56] Yes, this only applies to chunked upload and upload by url. Normal Special:Upload won't be affected by this change and will stay at 100mb [08:16:16] * hashar loves the misleading name of wgMaxUploadSize [08:16:56] lets roll it [08:16:58] Its not entirely misleading, normal upload takes it into account, it just does the min of that variable and the apache/php config [08:17:08] ty :) [08:17:14] how did you got a 4GB+ file uploaded to commons? ;) [08:17:33] importImages.php doesn't listen to $wgMaxUploadSize [08:17:38] ahhh [08:18:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002813 (https://phabricator.wikimedia.org/T191804) (owner: 10Brian Wolff) [08:18:17] and I guess there is no pointing in testing it by uploading a 4GB+ file [08:18:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: grafana [08:18:27] There is another limit in FileBackend it does listen to. Its amazing how many different upload limits exist [08:18:43] different projects, different times :D [08:18:44] yeah, that'd take a while [08:18:45] (03Merged) 10jenkins-bot: Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002813 (https://phabricator.wikimedia.org/T191804) (owner: 10Brian Wolff) [08:18:58] historically we had Special:Upload saving directly on disk [08:19:09] Aaron wrote the File abstraction system and added support for Swift [08:19:19] and we had some upload reboot done via UploadWizard [08:19:29] and at some point chunked uploads got added on top of that [08:19:36] We can test by looking at the limit on https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo i guess [08:19:38] anyway, you have been there, you know the story :] [08:19:52] Most of it... some of it was before me [08:19:58] !log hashar@deploy2002 Started scap: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]] [08:20:03] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [08:20:16] there was also a system to mass import files from GLAM [08:20:30] I was definitely there for gwtoolset... that was a bit of a mess [08:20:31] whicih was letting people send a list of files to us and MediaWiki would fetch/import them [08:20:36] yeah gwtoolset :) [08:20:39] (03PS1) 10Muehlenhoff: Switch grafana to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1002918 (https://phabricator.wikimedia.org/T349619) [08:20:40] i liked the idea [08:21:02] I think there was a lot of communication problems between WMF and the team from the chapter that was working on it [08:21:22] sounds familiar :] [08:21:26] lol [08:21:36] it always boils down to communication problems [08:21:41] !log hashar@deploy2002 hashar and bawolff: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:21:53] see visualeditor deployment or super protect [08:21:53] ;) [08:21:58] !log hashar@deploy2002 hashar and bawolff: Continuing with sync [08:22:07] I can see https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo has the higher limit on mwdebug [08:22:11] \o/ [08:22:21] 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10Rijikk) [08:22:39] I don't know whether it makes any sense to have 4GB+ files uploaded to commons [08:22:44] (03PS1) 10Ayounsi: Set primary_ixp: true on cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1002919 (https://phabricator.wikimedia.org/T322630) [08:22:48] Honestly, i think we should implement an s3 compatible upload endpoint, then people could just use one of the ten billion tools for dealing with s3 instead of reinventing the wheel [08:22:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch grafana to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1002918 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:22:58] then really, that sounds like a product decision and if people ask for it and we can do it ... there is no reason not to [08:23:05] and it is great to see old tasks being addressed \o/ [08:23:31] It seems to be mostly people wanting to upload old PD movies that are in 4K [08:23:35] Special:S3 [08:23:58] well I am not sure there is much point in hosting public domain movies for an encyclopedia / knowledge project ;-] [08:24:25] I guess if you're writing an article about the movie in question, its cool to be able to just play it [08:24:31] I wonder whether we have metrics regarding usage of those files, maybe it is just hoarding of public domain stuff [08:24:34] ah [08:24:41] touché, you have a point ;-] [08:25:00] anyway, I am happy to assist [08:25:01] although I'm sure someone out there is probably going to just upload a 50 trillion pixel png file of a fractal [08:25:11] Thanks :) [08:27:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: grafana [08:28:09] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, and 3 others: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10hashar) >>! In T191804#9321105, @AlexisJazz wrote: > https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#... [08:28:34] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, and 3 others: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10hashar) [08:28:53] bawolff: an AI would be able to detect it and flatten it to the underlying algorithm used to generate said fractal :D [08:28:53] Commons always uploads weird things just cause. File:Bible,_King_James_Version.svg is one of my favourite pointless huge uploads [08:28:55] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]] (duration: 08m 57s) [08:29:00] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [08:29:19] svg? [08:29:37] with each letter individually drawn with vectors? :D [08:29:41] yep [08:30:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:30:21] well I guess that one can be candidate for a speedy deletion [08:30:51] quoting its description "It is not intended for use with anything serious" [08:30:52] :) [08:31:16] Commons does literally have fractals that are just, how high resolution can you go. Like [[File:Panorama in der Mandelbox "Storage repository" 20231030.png]] [08:31:33] https://commons.wikimedia.org/wiki/Commons:Deletion_requests/File:Bible,_King_James_Version.svg "I think this is a useful stress test of Wikimedia's rendering capabilities" [08:31:35] ouch [08:32:40] anyway your change has been deployed :-] [08:32:45] (03PS1) 10Mforns: edit*-analytics: update mediawiki_history snapshot to 2024_01 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002920 [08:32:51] Who needs DoS attacks when you have users? [08:32:56] thanks :) [08:33:18] yeah [08:33:43] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, and 2 others: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) 05Open→03Resolved a:03Bawolff [08:33:49] then ultra large files that are uploaded just to make a point should probably be garbage collected [08:40:21] (03CR) 10Slyngshede: [C: 03+2] PuppetPendingCertificateRequest linting fails in production [alerts] - 10https://gerrit.wikimedia.org/r/1002453 (owner: 10Slyngshede) [08:41:58] (03Merged) 10jenkins-bot: PuppetPendingCertificateRequest linting fails in production [alerts] - 10https://gerrit.wikimedia.org/r/1002453 (owner: 10Slyngshede) [08:43:15] 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10taavi) Please include the details under "If you report this error to the Wikimedia System Administrators, please include the details below." in the footer. [08:46:28] (03CR) 10Slyngshede: [C: 03+2] P::installserver::proxy Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:47:41] (03PS3) 10Brouberol: idp_test: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) [08:48:01] (03CR) 10Brouberol: "Thanks Moritz, I updated the commit message header" [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [08:48:21] 10SRE, 10ops-codfw: PowerSupplyFailure - mw2389 - https://phabricator.wikimedia.org/T357377 (10Peachey88) [08:48:45] 10SRE, 10ops-codfw: Inbound interface errors - asw-c-codfw - https://phabricator.wikimedia.org/T357373 (10Peachey88) [08:50:56] 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10Rijikk) Sorry, I am not sure where is the footer in the report and what details should I include there. It is my first time on Phabricator... [08:52:43] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1002922 (https://phabricator.wikimedia.org/T354959) [08:53:46] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [08:55:50] (03CR) 10Brouberol: [C: 03+2] idp_test: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1002462 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [09:04:33] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host apifeatureusage1001.eqiad.wmnet with OS bookworm [09:06:51] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1002923 (https://phabricator.wikimedia.org/T349619) [09:14:21] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002920 (owner: 10Mforns) [09:15:25] (HelmReleaseBadStatus) firing: Helm release miscweb/research-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:15:31] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002924 (https://phabricator.wikimedia.org/T356736) [09:16:14] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [09:16:28] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [09:18:02] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002924 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [09:18:40] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [09:18:54] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002924 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [09:20:03] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) /srv/ipmi contains an old release of ipmitool from 2015 along with some legacy shared libs. I don't see it used anywhere anymore and we have integrated... [09:20:13] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:20:46] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apifeatureusage1001.eqiad.wmnet with OS bookworm [09:20:58] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:21:21] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:21:25] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [09:21:43] (03CR) 10Ayounsi: [C: 03+2] Set primary_ixp: true on cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1002919 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [09:22:03] !log delete sessionstore pod to force rescheduling [09:22:05] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:21] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [09:22:21] (03Merged) 10jenkins-bot: Set primary_ixp: true on cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1002919 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [09:23:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Volans) FYI the host is up and running with the old OS but new puppet role and puppet disabled since 26 days, it has disappeared from puppetdb (because of the puppet disabled... [09:23:06] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [09:32:21] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) /srv/firmware was used in the past to push firmware updates to, before we had the cookbooks to handle this (also confirmed by Rob). All the remaining f... [09:34:18] (03PS1) 10MVernon: Move sessionstore-k8s-scheduling alert to serviceops [alerts] - 10https://gerrit.wikimedia.org/r/1002925 [09:39:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move sessionstore-k8s-scheduling alert to serviceops [alerts] - 10https://gerrit.wikimedia.org/r/1002925 (owner: 10MVernon) [09:39:41] (03CR) 10MVernon: [C: 03+2] Move sessionstore-k8s-scheduling alert to serviceops [alerts] - 10https://gerrit.wikimedia.org/r/1002925 (owner: 10MVernon) [09:41:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:58] ACKNOWLEDGEMENT - Disk space on vrts1002 is CRITICAL: DISK CRITICAL - free space: /srv/otrs-data 0 MB (0% inode=60%): Jcrespo test host, ran out of space - The acknowledgement expires at: 2024-02-14 09:39:43. https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [09:46:03] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) [09:46:15] 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff) [09:46:52] 10SRE, 10Infrastructure-Foundations: Clarify status of various directories in /srv on apt1001 - https://phabricator.wikimedia.org/T357306 (10MoritzMuehlenhoff) 05Open→03Resolved All done [09:47:13] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ABran-WMF) 05Open→03Resolved everything's back to normal: ` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | b... [09:57:18] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host apifeatureusage1001.eqiad.wmnet with OS bookworm [09:58:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host clouddb2002-dev.codfw.wmnet [10:04:44] 10SRE, 10Traffic, 10Patch-For-Review: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10akosiaris) Speaking with a appservers/wikikube clusters hat on, we don't see any problems with the lowering of the `dyna.wikimedia.org` from 10 minutes to 5 minutes. With an... [10:05:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb2002-dev.codfw.wmnet [10:06:19] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [10:09:14] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [10:12:13] (03CR) 10Slyngshede: [C: 03+2] Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 (owner: 10Slyngshede) [10:13:28] (03Merged) 10jenkins-bot: Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 (owner: 10Slyngshede) [10:15:43] 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389 (10LSobanski) [10:16:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki: Extend /portals max-age from 24h to 1 year (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle) [10:18:01] (03PS1) 10Muehlenhoff: grafana::loki: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1002929 [10:20:48] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002930 (https://phabricator.wikimedia.org/T351430) [10:20:57] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002930 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [10:21:54] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002930 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [10:21:58] (03CR) 10David Caro: "It was yes :), sorry for the confusion" [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [10:22:05] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:22:18] (03CR) 10Slyngshede: [C: 03+2] D:uwsgi::app Allow disabling of monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:22:31] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:22:37] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:22:58] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:23:02] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [10:23:22] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [10:23:32] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apifeatureusage1001.eqiad.wmnet with OS bookworm [10:25:47] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host apifeatureusage1001.eqiad.wmnet with OS bookworm [10:25:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [10:26:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1002929 (owner: 10Muehlenhoff) [10:32:22] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana::loki: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1002929 (owner: 10Muehlenhoff) [10:36:22] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [10:38:57] 10SRE, 10ops-codfw: mr1-codfw down - https://phabricator.wikimedia.org/T357291 (10jcrespo) This is now fixed, right? Or do you prefer to keep it up for followup? [10:39:20] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [10:39:49] (03CR) 10Filippo Giunchedi: multirootca: depend on cfssl when generating CRLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [10:41:34] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apifeatureusage1001.eqiad.wmnet with OS bookworm [10:43:49] (03PS1) 10Hashar: Let Gerrit manage light/dark theme [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002931 (https://phabricator.wikimedia.org/T354886) [10:43:53] (03PS1) 10Hashar: Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) [10:44:19] (03CR) 10CI reject: [V: 04-1] Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:44:49] (03CR) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [10:45:05] (03PS2) 10Filippo Giunchedi: multirootca: depend on cfssl when generating CRLs [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) [10:45:07] (03PS2) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) [10:45:09] (03PS2) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [10:45:11] (03PS2) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [10:45:13] (03PS2) 10Filippo Giunchedi: postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) [10:46:38] (03PS2) 10Hashar: Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) [10:49:28] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host apifeatureusage1001.eqiad.wmnet with OS bullseye [10:52:31] (03CR) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [10:52:42] (03PS3) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [10:52:44] (03PS3) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [10:52:46] (03PS3) 10Filippo Giunchedi: postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) [10:54:02] (03CR) 10CI reject: [V: 04-1] puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [10:56:25] (03PS1) 10Brouberol: superset: add extra FQDNs for the ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002934 (https://phabricator.wikimedia.org/T356482) [10:57:23] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [10:57:31] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1100) [11:01:04] !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for mw2388.codfw.wmnet [11:01:06] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2388.codfw.wmnet [11:01:27] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [11:01:54] (03PS1) 10Stevemunene: modify airflow clean logs script [puppet] - 10https://gerrit.wikimedia.org/r/1002935 (https://phabricator.wikimedia.org/T339015) [11:04:23] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apifeatureusage1001.eqiad.wmnet with reason: host reimage [11:05:19] (03CR) 10Gmodena: [C: 03+2] Eventstreams: 12 feb 2024 redaction list update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 (owner: 10Htriedman) [11:06:21] (03Merged) 10jenkins-bot: Eventstreams: 12 feb 2024 redaction list update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002597 (owner: 10Htriedman) [11:10:21] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [11:10:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [11:10:33] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [11:11:50] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [11:12:23] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [11:12:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10akosiaris) [11:13:01] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [11:13:08] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [11:14:08] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [11:14:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [11:14:28] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [11:14:56] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:15:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [11:20:03] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apifeatureusage1001.eqiad.wmnet with OS bullseye [11:20:20] (03CR) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle) [11:21:27] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host apifeatureusage2001.codfw.wmnet with OS bullseye [11:21:59] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Allow setting deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002399 (owner: 10Clément Goubert) [11:23:34] (03Merged) 10jenkins-bot: mediawiki: Allow setting deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002399 (owner: 10Clément Goubert) [11:24:15] !log Change default maxUnavailable for mw-on-k8s to 10% [11:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:24:29] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:24:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:24:32] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:25:00] (03PS1) 10Hashar: gerrit: remove html based commentlinks [puppet] - 10https://gerrit.wikimedia.org/r/1002938 (https://phabricator.wikimedia.org/T354886) [11:26:27] (03PS4) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [11:26:29] (03PS4) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [11:26:31] (03PS4) 10Filippo Giunchedi: postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) [11:26:37] (03PS1) 10Majavah: Failover dumps to clouddumps1002 [dns] - 10https://gerrit.wikimedia.org/r/1002940 (https://phabricator.wikimedia.org/T321313) [11:27:00] (03CR) 10Hashar: "The commentlink entries for PipelineLib comes from 2019 https://gerrit.wikimedia.org/r/c/operations/puppet/+/490640 which predate our upgr" [puppet] - 10https://gerrit.wikimedia.org/r/1002938 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [11:27:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9529467, @Jhancock.wm wrote: > I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted... [11:27:26] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [11:28:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)... [11:28:05] (03PS1) 10Majavah: hieradata: Failover dumps to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002941 (https://phabricator.wikimedia.org/T321313) [11:28:29] (03PS2) 10Majavah: hieradata: Failover dumps web to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002941 (https://phabricator.wikimedia.org/T321313) [11:31:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:31:32] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apifeatureusage2001.codfw.wmnet with reason: host reimage [11:32:04] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:32:48] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:33:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:34:15] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apifeatureusage2001.codfw.wmnet with reason: host reimage [11:34:21] !log cgoubert@deploy2002 Started scap: Change default maxUnavailable for mw-on-k8s to 10% [11:37:39] !log cgoubert@deploy2002 Finished scap: Change default maxUnavailable for mw-on-k8s to 10% (duration: 03m 17s) [11:45:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: T350458 [11:45:22] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [11:45:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: T350458 [11:46:41] (03CR) 10Clément Goubert: [C: 03+1] service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:47:33] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 3 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10ovasileva) Another step in the process here would be to look at how the change would visually affect current layouts on enwiki and across wikis, as wel... [11:47:59] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:48:59] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:50:05] (03CR) 10Clément Goubert: [C: 03+1] termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:50:41] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apifeatureusage2001.codfw.wmnet with OS bullseye [11:56:50] (03PS2) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) [11:59:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:00:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:00:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:00:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:00:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56689 and previous config saved to /var/cache/conftool/dbconfig/20240213-120035-ladsgroup.json [12:01:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:03:53] (03CR) 10Muehlenhoff: puppetdb: allow both secret() and source for site key material (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:08:19] (03CR) 10Slyngshede: "Not sure if this is the best solution. I suppose we could remove the labels, but we also want to be able to route the alerts correctly." [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:09:47] (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:52] 10SRE: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% - https://phabricator.wikimedia.org/T357198 (10hnowlan) 05Open→03Resolved a:03hnowlan >>! In T357198#9534354, @Eevans wrote: >>>! In T357198#9533155, @hnowlan wrote: >> Could `Too many eqiad mediawiki originals uploads` be a... [12:12:50] volans: I restarted the dump_cloud_ip_ranges.service [12:13:16] I am concerned about the increase of alert spam from systemd (including systemd timers) [12:13:50] (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:54] (03CR) 10Hnowlan: [C: 03+2] kubernetes: move 5 mw hosts to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/998996 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [12:17:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56690 and previous config saved to /var/cache/conftool/dbconfig/20240213-121736-ladsgroup.json [12:17:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:19:53] (03PS2) 10Samtar: IS: Enable edit recovery on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992203 (https://phabricator.wikimedia.org/T355548) [12:21:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:57] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T357400 (10phaultfinder) [12:23:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:35] jynus: ack, why was it failing? is it even running there? [12:28:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1002941 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [12:28:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [dns] - 10https://gerrit.wikimedia.org/r/1002940 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [12:28:50] (03CR) 10Majavah: [C: 03+2] Failover dumps to clouddumps1002 [dns] - 10https://gerrit.wikimedia.org/r/1002940 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [12:29:15] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert) [12:29:25] (03CR) 10Majavah: [C: 03+2] hieradata: Failover dumps web to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1002941 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [12:32:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56691 and previous config saved to /var/cache/conftool/dbconfig/20240213-123242-ladsgroup.json [12:36:50] (03PS1) 10Hnowlan: kubernetes: fix typo in host glob [puppet] - 10https://gerrit.wikimedia.org/r/1002969 [12:39:15] (03CR) 10Kamila Součková: [C: 03+1] kubernetes: fix typo in host glob [puppet] - 10https://gerrit.wikimedia.org/r/1002969 (owner: 10Hnowlan) [12:39:42] (03CR) 10Hnowlan: [C: 03+2] kubernetes: fix typo in host glob [puppet] - 10https://gerrit.wikimedia.org/r/1002969 (owner: 10Hnowlan) [12:39:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1035.eqiad.wmnet with OS bullseye [12:39:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye [12:42:39] (03CR) 10JMeybohm: Add helmfile for running MediaWiki one-off jobs. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [12:42:57] (03CR) 10Paladox: [C: 03+1] "We use checkers to display this now so +1." [puppet] - 10https://gerrit.wikimedia.org/r/1002938 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:45:25] (SystemdUnitFailed) firing: ferm.service on kubernetes1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:56] (03CR) 10Paladox: [C: 03+1] Let Gerrit manage light/dark theme [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002931 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:46:06] (03PS1) 10Clément Goubert: mw-on-k8s: Raise the number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002973 (https://phabricator.wikimedia.org/T357402) [12:47:00] (03CR) 10Paladox: [C: 03+1] Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:47:46] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1431.eqiad.wmnet with OS bullseye [12:47:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56692 and previous config saved to /var/cache/conftool/dbconfig/20240213-124748-ladsgroup.json [12:48:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1430.eqiad.wmnet with OS bullseye [12:48:22] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1434.eqiad.wmnet with OS bullseye [12:48:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1453.eqiad.wmnet with OS bullseye [12:48:25] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1385.eqiad.wmnet with OS bullseye [12:49:24] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [12:50:26] (SystemdUnitFailed) resolved: ferm.service on kubernetes1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:08] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 4 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10RHo) [12:53:00] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 4 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10RHo) @DTorsani-WMF and @JScherer-WMF - per @Ladsgroup suggestions, wondering if you both can offer any ux recommendations for recommended thumbnail siz... [12:54:45] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1035.eqiad.wmnet with reason: host reimage [12:54:49] !log restarting envoy on baremetal mediawiki appservers [12:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:16] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Jeff G." . # T357403 [12:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:21] T357403: Server side upload for Jeff G. - https://phabricator.wikimedia.org/T357403 [12:56:23] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.eqiad.wmnet [12:57:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1035.eqiad.wmnet with reason: host reimage [12:58:06] (03CR) 10MVernon: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1002663 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1300) [13:00:44] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1431.eqiad.wmnet with reason: host reimage [13:01:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1434.eqiad.wmnet with reason: host reimage [13:01:26] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1430.eqiad.wmnet with reason: host reimage [13:01:41] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1453.eqiad.wmnet with reason: host reimage [13:02:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1385.eqiad.wmnet with reason: host reimage [13:02:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56693 and previous config saved to /var/cache/conftool/dbconfig/20240213-130255-ladsgroup.json [13:02:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:03:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:03:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:03:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56694 and previous config saved to /var/cache/conftool/dbconfig/20240213-130316-ladsgroup.json [13:03:38] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1431.eqiad.wmnet with reason: host reimage [13:04:11] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1006.eqiad.wmnet [13:06:01] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1385.eqiad.wmnet with reason: host reimage [13:07:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:18] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1434.eqiad.wmnet with reason: host reimage [13:08:46] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002981 (https://phabricator.wikimedia.org/T351430) [13:08:52] 10SRE, 10ops-codfw: PowerSupplyFailure - mw2389 - https://phabricator.wikimedia.org/T357377 (10fgiunchedi) [13:08:55] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:57] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002981 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [13:09:19] 10SRE, 10ops-codfw: PowerSupplyFailure - mw2389 - https://phabricator.wikimedia.org/T357377 (10fgiunchedi) @Peachey88 thank you for your help on this, however please don't retitle @phaultfinder tasks as the title is used as a search key and a new task will be created (T357400) [13:09:23] 10SRE, 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T357400 (10fgiunchedi) [13:09:50] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10MoritzMuehlenhoff) During the preparation of the apt server migration I noticed th... [13:09:53] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002981 (https://phabricator.wikimedia.org/T351430) (owner: 10Kosta Harlan) [13:09:54] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357373 (10fgiunchedi) [13:10:27] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [13:10:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1453.eqiad.wmnet with reason: host reimage [13:10:52] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:10:55] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [13:11:15] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [13:11:24] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:11:46] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:12:09] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.eqiad.wmnet [13:13:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1430.eqiad.wmnet with reason: host reimage [13:15:25] (HelmReleaseBadStatus) firing: Helm release miscweb/research-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:21:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1431.eqiad.wmnet with OS bullseye [13:22:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1007.eqiad.wmnet [13:23:50] (SystemdUnitFailed) firing: (2) dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1385.eqiad.wmnet with OS bullseye [13:25:40] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:26:27] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1434.eqiad.wmnet with OS bullseye [13:26:44] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:26:59] (03CR) 10Hashar: [C: 03+2] Let Gerrit manage light/dark theme [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002931 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:27:33] (03Merged) 10jenkins-bot: Let Gerrit manage light/dark theme [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002931 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:28:04] !log hashar@deploy2002 Started deploy [gerrit/gerrit@b02c97e]: Let Gerrit manage light/dark theme [13:28:12] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@b02c97e]: Let Gerrit manage light/dark theme (duration: 00m 07s) [13:28:39] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1360/console" [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:28:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1453.eqiad.wmnet with OS bullseye [13:29:40] (03PS1) 10Majavah: team-wmcs: haproxy: take backup servers in account in calculations [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) [13:30:15] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357373 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:30:46] (03CR) 10CI reject: [V: 04-1] team-wmcs: haproxy: take backup servers in account in calculations [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) (owner: 10Majavah) [13:31:19] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1361/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:31:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1430.eqiad.wmnet with OS bullseye [13:31:52] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] multirootca: depend on cfssl when generating CRLs [puppet] - 10https://gerrit.wikimedia.org/r/1002384 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:31:54] (03PS1) 10Slyngshede: C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 [13:32:37] (03CR) 10Hashar: [C: 03+2] Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:33:10] (03Merged) 10jenkins-bot: Support Gerrit 3.8 CSS styling API [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:33:14] (03CR) 10CI reject: [V: 04-1] C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 (owner: 10Slyngshede) [13:33:48] !log hashar@deploy2002 Started deploy [gerrit/gerrit@7dd9a27]: Support Gerrit 3.8 CSS styling API - T354886 [13:33:53] T354886: Upgrade to Gerrit 3.8 - https://phabricator.wikimedia.org/T354886 [13:33:55] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@7dd9a27]: Support Gerrit 3.8 CSS styling API - T354886 (duration: 00m 07s) [13:33:58] (03PS2) 10Majavah: team-wmcs: haproxy: take backup servers in account in calculations [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) [13:34:47] (03CR) 10Filippo Giunchedi: [C: 03+2] postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:34:50] (03PS2) 10Slyngshede: C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 [13:34:57] (03PS5) 10Filippo Giunchedi: postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) [13:37:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56695 and previous config saved to /var/cache/conftool/dbconfig/20240213-133709-ladsgroup.json [13:37:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:38:01] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] postgresql: use 'systemd reload' for pgreload [puppet] - 10https://gerrit.wikimedia.org/r/1002388 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:38:30] (03CR) 10Hashar: [C: 03+2] "I have tested it and it works! :)" [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/1002932 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:38:53] (03PS3) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) [13:38:55] (03PS5) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [13:38:57] (03PS5) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [13:39:02] (03CR) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:40:04] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.eqiad.wmnet [13:40:10] (HelmReleaseBadStatus) resolved: Helm release miscweb/research-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:40:38] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [13:40:51] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1362/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:42:29] (03CR) 10Hashar: [C: 03+2] Gerrit 3.8 no more set redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:42:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [13:42:47] (03PS3) 10Hashar: wm-checks-api: Gerrit 3.8 no more sets redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) [13:43:23] (03CR) 10Hashar: [C: 03+2] wm-checks-api: Gerrit 3.8 no more sets redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:44:01] (03PS6) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [13:44:02] (03PS6) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [13:44:04] (03PS4) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) [13:44:06] (03Merged) 10jenkins-bot: wm-checks-api: Gerrit 3.8 no more sets redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:44:10] (03CR) 10CI reject: [V: 04-1] puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:44:18] (03CR) 10CI reject: [V: 04-1] postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:45:11] !log hashar@deploy2002 Started deploy [gerrit/gerrit@737c475]: wm-checks-api: Gerrit 3.8 no more sets redundant real_author [13:45:17] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@737c475]: wm-checks-api: Gerrit 3.8 no more sets redundant real_author (duration: 00m 07s) [13:45:40] (03PS1) 10Slyngshede: Add logging to default uwsgi [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1002989 [13:47:05] (03CR) 10Filippo Giunchedi: [C: 03+1] C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 (owner: 10Slyngshede) [13:47:47] (03PS2) 10Slyngshede: Add logging to default uwsgi [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1002989 [13:48:12] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1005.eqiad.wmnet [13:48:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:49:17] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:49:23] (03PS5) 10Filippo Giunchedi: puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) [13:49:51] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:49:58] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] puppetserver: add Puppet CA custom name and SANs [puppet] - 10https://gerrit.wikimedia.org/r/1002385 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:50:45] (03CR) 10Volans: [C: 04-1] "In order to simplify replies I've put my main comment with my main concerns at the top of the cookbook file." [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:52:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56696 and previous config saved to /var/cache/conftool/dbconfig/20240213-135215-ladsgroup.json [13:56:18] (03PS7) 10Filippo Giunchedi: postgresql: install configuration before starting the server [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) [13:56:20] (03PS7) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [13:56:22] (03CR) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:57:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [13:59:40] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:45] (03PS4) 10Jaime Nuche: support Zuul v2 on bullseye contint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002461 [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:50] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:23] (03CR) 10Hashar: [C: 03+1] support Zuul v2 on bullseye contint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (owner: 10Jaime Nuche) [14:03:55] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:51] (03PS3) 10Slyngshede: Add logging to default uwsgi [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1002989 [14:07:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56697 and previous config saved to /var/cache/conftool/dbconfig/20240213-140722-ladsgroup.json [14:07:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:08:12] !log restarting envoy on baremetal mediawiki api servers [14:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] (03CR) 10Filippo Giunchedi: "Yeah I'm also not sure why not all PKI certs since those are the intermediates, meaning if they expire there's sth funky going on with cfs" [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:11:26] !log import etherpad-lite 1.9.7-2 on apt host into bookworm-wikimedia - T316421 [14:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [14:13:31] jouncebot: next [14:13:31] In 1 hour(s) and 46 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1600) [14:13:41] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1363/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [14:17:56] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:18:24] (03CR) 10Eevans: [C: 03+2] sessionstore: decommission sessionstore200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1002663 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [14:18:26] !log bounce puppetserver on puppetserver1003 to test noop config change - T352640 [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:30] T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver - https://phabricator.wikimedia.org/T352640 [14:19:04] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:05] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts sessionstore[2001-2003].codfw.wmnet [14:21:22] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Jelto) `etherpad-lite` `1.9.7` is available now also for bookworm. Thanks again for @MoritzMuehlenhoff and @akosiaris for troubleshooting issues... [14:21:34] (03PS1) 10Mhorsey: Load WikimediaCampaignEvents if CampaignEvents is loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002993 (https://phabricator.wikimedia.org/T347909) [14:22:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56698 and previous config saved to /var/cache/conftool/dbconfig/20240213-142228-ladsgroup.json [14:22:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:22:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:22:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:22:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T352010)', diff saved to https://phabricator.wikimedia.org/P56699 and previous config saved to /var/cache/conftool/dbconfig/20240213-142250-ladsgroup.json [14:23:45] (03PS1) 10Mhorsey: Remove explicit load of WikimediaCampaignevents extension from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002994 (https://phabricator.wikimedia.org/T347909) [14:25:06] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1364/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [14:26:40] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2061*,elastic2062*,elastic2089* for switch maintenance - bking@cumin2002 - T355863 [14:26:44] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2061*,elastic2062*,elastic2089* for switch maintenance - bking@cumin2002 - T355863 [14:26:45] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [14:27:40] Hey folks! We would like to deploy a new extension (T347909) this week, and according to the documentation, we need a dedicated deployment window for that. Is there any deployer willing to help, and could you please give me your availability for the week? TIA! [14:27:41] T347909: Deploy the WikimediaCampaignEvents extension to production - https://phabricator.wikimedia.org/T347909 [14:28:49] (03PS1) 10Eevans: site.pp: remove sessionstore200[1-3] (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1002996 (https://phabricator.wikimedia.org/T356828) [14:30:34] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [14:33:26] !log imported openssl 1.1.1w-0+deb11u1+wmf1 to component/haproxy26 T352744 [14:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:31] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [14:35:17] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [14:36:25] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sessionstore[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [14:36:25] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:25] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sessionstore[2001-2003].codfw.wmnet [14:36:51] (03CR) 10Eevans: [C: 03+2] site.pp: remove sessionstore200[1-3] (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1002996 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [14:37:27] 10SRE, 10ops-codfw: mr1-codfw down - https://phabricator.wikimedia.org/T357291 (10ayounsi) 05Open→03Resolved a:03ayounsi [14:38:22] (03PS1) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T259611) [14:38:49] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:07] 10ops-codfw, 10Cassandra, 10decommission-hardware: Decommission sessionstore200[1-3] - https://phabricator.wikimedia.org/T357356 (10Eevans) [14:42:11] (03PS1) 10Effie Mouzeli: services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) [14:43:15] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:43:35] (03CR) 10Effie Mouzeli: "]" [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) (owner: 10Effie Mouzeli) [14:44:15] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:14] (03PS1) 10Effie Mouzeli: services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003001 (https://phabricator.wikimedia.org/T356766) [14:46:17] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=cp202(7|8).codfw.wmnet [14:47:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2027-2028].codfw.wmnet with reason: T355863 [14:48:02] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [14:48:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2027-2028].codfw.wmnet with reason: T355863 [14:52:26] (03PS2) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T259611) [14:52:28] (03PS1) 10Hashar: python-build: add make and virtualenv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T259611) [14:54:44] (03CR) 10Hashar: "I'd like to have some better default than running as `root` and having the working directory set to `/`, even though they are not going to" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [14:55:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T352010)', diff saved to https://phabricator.wikimedia.org/P56700 and previous config saved to /var/cache/conftool/dbconfig/20240213-145541-ladsgroup.json [14:55:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:56:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (owner: 10Jaime Nuche) [14:56:42] (03CR) 10Hashar: "The deploy repositories (such as `integration/zuul/deploy`) have a makefile to build the wheels and another one used when deploying on the" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [14:58:49] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:39] (03CR) 10BCornwall: [C: 03+2] ncmonitor: Add partman config [puppet] - 10https://gerrit.wikimedia.org/r/1002674 (https://phabricator.wikimedia.org/T356710) (owner: 10BCornwall) [15:03:17] (03PS1) 10Muehlenhoff: puppetserver: Also install the tool to update netboot images on puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) [15:03:44] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [15:06:03] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2005.codfw.wmnet with OS bookworm [15:08:23] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [15:10:32] (03CR) 10Stevemunene: [C: 03+2] edit*-analytics: update mediawiki_history snapshot to 2024_01 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002920 (owner: 10Mforns) [15:10:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56701 and previous config saved to /var/cache/conftool/dbconfig/20240213-151047-ladsgroup.json [15:11:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1149.eqiad.wmnet [15:11:39] (03Merged) 10jenkins-bot: edit*-analytics: update mediawiki_history snapshot to 2024_01 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002920 (owner: 10Mforns) [15:13:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415 (10calbon) p:05Medium→03High [15:13:32] !log stevemunene@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:13:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios settings that might have changed? [15:14:11] !log stevemunene@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:14:31] !log running `homer 'cr*eqiad*' commit 'T351074' [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:38] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:14:47] !log stevemunene@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [15:15:18] !log stevemunene@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [15:15:57] !log stevemunene@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [15:16:21] (03PS1) 10Jelto: Release 0.7 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/1003007 (https://phabricator.wikimedia.org/T316421) [15:16:23] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [15:16:27] !log stevemunene@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [15:17:49] !log stevemunene@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [15:18:02] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [15:18:22] !log stevemunene@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:19:02] !log stevemunene@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [15:19:25] !log stevemunene@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [15:19:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] Release 0.7 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/1003007 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:20:01] !log stevemunene@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [15:20:13] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1149.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:20:24] !log stevemunene@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [15:21:30] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1431.eqiad.wmnet|mw1430.eqiad.wmnet|mw1434.eqiad.wmnet|mw1453.eqiad.wmnet|mw1385.eqiad.wmnet),cluster=kubernetes,service=kubesvc [15:21:32] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [15:21:37] (03PS8) 10Filippo Giunchedi: puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) [15:22:16] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] puppetdb: allow both secret() and source for site key material [puppet] - 10https://gerrit.wikimedia.org/r/1002386 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [15:24:35] (03CR) 10Hnowlan: [C: 03+1] mw-on-k8s: Raise the number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002973 (https://phabricator.wikimedia.org/T357402) (owner: 10Clément Goubert) [15:25:37] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise the number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002973 (https://phabricator.wikimedia.org/T357402) (owner: 10Clément Goubert) [15:25:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56702 and previous config saved to /var/cache/conftool/dbconfig/20240213-152554-ladsgroup.json [15:26:41] (03Merged) 10jenkins-bot: mw-on-k8s: Raise the number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002973 (https://phabricator.wikimedia.org/T357402) (owner: 10Clément Goubert) [15:26:45] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1365/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [15:26:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm [15:27:06] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm [15:27:32] (03PS5) 10Alexandros Kosiaris: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) [15:28:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1149.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:28:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1149.eqiad.wmnet [15:29:14] !log cgoubert@deploy2002 Started scap: mw-on-k8s: Raise the number of canary replicas - T357402 [15:29:17] T357402: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 [15:29:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:30:33] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2061-2062,2089].codfw.wmnet with reason: T355863 [15:30:40] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [15:30:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2061-2062,2089].codfw.wmnet with reason: T355863 [15:32:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:32:12] !log cgoubert@deploy2002 Finished scap: mw-on-k8s: Raise the number of canary replicas - T357402 (duration: 02m 58s) [15:32:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:33:09] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission db1149.eqiad.wmnet - https://phabricator.wikimedia.org/T357293 (10ABran-WMF) a:05ABran-WMF→03None [15:34:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1133.eqiad.wmnet [15:34:55] (03PS1) 10Filippo Giunchedi: alertmanager: re-notify for SystemdUnitFailed after 24h [puppet] - 10https://gerrit.wikimedia.org/r/1003009 (https://phabricator.wikimedia.org/T357333) [15:35:01] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10Jdforrester-WMF) [15:37:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [15:37:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [15:37:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T352010)', diff saved to https://phabricator.wikimedia.org/P56703 and previous config saved to /var/cache/conftool/dbconfig/20240213-153720-ladsgroup.json [15:37:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:39:41] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [15:40:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncmonitor1001.eqiad.wmnet with OS bookworm [15:40:46] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm executed with errors: - ncm... [15:41:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T352010)', diff saved to https://phabricator.wikimedia.org/P56704 and previous config saved to /var/cache/conftool/dbconfig/20240213-154100-ladsgroup.json [15:41:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:41:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:41:34] (03CR) 10Alexandros Kosiaris: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:41:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:42:31] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1133.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:42:41] (03Merged) 10jenkins-bot: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:42:47] (03PS5) 10Alexandros Kosiaris: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) [15:43:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1133.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:43:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1133.eqiad.wmnet [15:44:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1135.eqiad.wmnet [15:44:10] !log moving netbox links and pre-configuring lsw1-a4-codfw for servers before network move T355863 [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:14] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [15:44:53] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1133.eqiad.wmnet - https://phabricator.wikimedia.org/T357273 (10ABran-WMF) a:05ABran-WMF→03None [15:45:17] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1149.eqiad.wmnet - https://phabricator.wikimedia.org/T357293 (10ABran-WMF) [15:46:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:47:04] (03Merged) 10jenkins-bot: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:50:21] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [15:50:30] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:50:44] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:52:00] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:52:19] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1135.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:53:01] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:53:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1135.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [15:53:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1135.eqiad.wmnet [15:53:39] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1135.eqiad.wmnet - https://phabricator.wikimedia.org/T357275 (10ABran-WMF) [15:53:52] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1135.eqiad.wmnet - https://phabricator.wikimedia.org/T357275 (10ABran-WMF) a:05ABran-WMF→03None [15:54:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1139.eqiad.wmnet [15:55:32] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:56:15] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:57:17] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [15:59:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10klausman) [15:59:08] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10klausman) [15:59:44] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:00:05] eoghan, jelto, and arnoldokoth: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1600). [16:02:04] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1139.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:02:07] (03CR) 10Jaime Nuche: "The entrypoint is already handling the deploy dir: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-imag" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [16:02:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10Jhancock.wm) Forgot to update earlier. Rack is physically ready [16:03:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1139.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:03:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:03:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1139.eqiad.wmnet [16:03:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw-a-codfw,cr[1-2]-codfw,lsw1-a4-codfw.mgmt [16:03:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw-a-codfw,cr[1-2]-codfw,lsw1-a4-codfw.mgmt [16:04:02] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1139.eqiad.wmnet - https://phabricator.wikimedia.org/T357287 (10ABran-WMF) a:05ABran-WMF→03None [16:04:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for 23 hosts [16:04:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 23 hosts [16:04:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm [16:04:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1140.eqiad.wmnet [16:05:31] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Migrating servers in codfw rack A4 to lsw1-a4-codfw [16:05:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Migrating servers in codfw rack A4 to lsw1-a4-codfw [16:06:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=349240a0-30c3-4371-9418-7f1f46072237) set by cmooney@cumin1002 fo... [16:08:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:08:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:08:25] !log moving codfw rack a4 server links T355863 [16:08:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T352010)', diff saved to https://phabricator.wikimedia.org/P56707 and previous config saved to /var/cache/conftool/dbconfig/20240213-160826-ladsgroup.json [16:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:43] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [16:08:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:09:26] 10SRE, 10Domains: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10AKanji-WMF) [16:09:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:06] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:10:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:26] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [16:11:28] (03CR) 10Ryan Kemper: [C: 03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1002922 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff) [16:12:06] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1140.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:13:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1140.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:13:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1140.eqiad.wmnet [16:13:34] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:13:53] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1140.eqiad.wmnet - https://phabricator.wikimedia.org/T357288 (10ABran-WMF) a:05ABran-WMF→03None [16:14:16] jouncebot: now [16:14:16] For the next 0 hour(s) and 45 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1600) [16:14:36] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1144.eqiad.wmnet [16:14:47] (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:15] (03PS1) 10JHathaway: nginx: change reload to reload-or-restart [puppet] - 10https://gerrit.wikimedia.org/r/1003013 (https://phabricator.wikimedia.org/T342784) [16:18:17] (03PS1) 10JHathaway: puppetdb: Use the nginx certs [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) [16:18:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003013 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [16:19:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [16:21:17] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:21:37] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Jhancock.wm) rack is physically ready for tomorrow. [16:22:24] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1002929 (owner: 10Muehlenhoff) [16:23:04] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Enable stunnel for Loki data transfer [puppet] - 10https://gerrit.wikimedia.org/r/994999 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [16:23:14] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1144.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:23:49] (JobUnavailable) resolved: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1144.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:24:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1144.eqiad.wmnet [16:24:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1145.eqiad.wmnet [16:25:12] (03PS2) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 300s [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) [16:25:32] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1144.eqiad.wmnet - https://phabricator.wikimedia.org/T357289 (10ABran-WMF) a:05ABran-WMF→03None [16:26:43] (03PS2) 10Andrea Denisse: grafana: Enable stunnel for Loki data transfer [puppet] - 10https://gerrit.wikimedia.org/r/994999 (https://phabricator.wikimedia.org/T352665) [16:30:15] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:30:21] (03PS1) 10Ssingh: wikimedia.org: add digicert to CAA record for email.wm.org [dns] - 10https://gerrit.wikimedia.org/r/1003017 (https://phabricator.wikimedia.org/T346394) [16:31:26] (03PS2) 10Slyngshede: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) [16:32:21] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1145.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:32:28] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Enable stunnel for Loki data transfer [puppet] - 10https://gerrit.wikimedia.org/r/994999 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [16:33:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1145.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:33:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1145.eqiad.wmnet [16:34:23] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1145.eqiad.wmnet - https://phabricator.wikimedia.org/T357290 (10ABran-WMF) a:05ABran-WMF→03None [16:34:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1146.eqiad.wmnet [16:34:30] (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1003017 (https://phabricator.wikimedia.org/T346394) (owner: 10Ssingh) [16:34:34] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [16:34:43] (03PS2) 10Ssingh: wikimedia.org: add digicert to CAA record for email.wm.org [dns] - 10https://gerrit.wikimedia.org/r/1003017 (https://phabricator.wikimedia.org/T346394) [16:35:13] (03CR) 10Slyngshede: "The updated patch enables all, but now the team isn't totally correct, as different certificates may belong to different teams. Seeing as " [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [16:35:22] (03CR) 10Ssingh: "comment added, no code change" [dns] - 10https://gerrit.wikimedia.org/r/1003017 (https://phabricator.wikimedia.org/T346394) (owner: 10Ssingh) [16:35:37] jhathaway, rzl: if there are no changes in the puppet request window yet, is it okay if I deploy some backports around that time? [16:35:51] (they’re currently going through gate-and-submit on master, I haven’t created the wmf cherry-picks yet) [16:35:55] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add digicert to CAA record for email.wm.org [dns] - 10https://gerrit.wikimedia.org/r/1003017 (https://phabricator.wikimedia.org/T346394) (owner: 10Ssingh) [16:36:03] Lucas_WMDE: sure go for it [16:36:11] Lucas_WMDE: yep have at it, if anything shows up at the last minute we'll coordinate with you [16:36:15] !log running authdns-update for CR 1003017: T346394 [16:36:15] ok thanks! [16:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:47] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1003013 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [16:39:19] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [16:40:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T352010)', diff saved to https://phabricator.wikimedia.org/P56709 and previous config saved to /var/cache/conftool/dbconfig/20240213-164021-ladsgroup.json [16:40:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:41:18] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1146.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:41:34] (03PS1) 10Lucas Werkmeister (WMDE): Use EditEntity for ItemMergeInteractor [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002950 (https://phabricator.wikimedia.org/T356149) [16:41:40] (03PS1) 10Lucas Werkmeister (WMDE): Use EditEntity for MergeLexemesInteractor [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002951 (https://phabricator.wikimedia.org/T356149) [16:41:46] (03PS1) 10Lucas Werkmeister (WMDE): Use EditEntity for ItemMergeInteractor [extensions/Wikibase] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002952 (https://phabricator.wikimedia.org/T356149) [16:41:50] (03PS1) 10Lucas Werkmeister (WMDE): Use EditEntity for MergeLexemesInteractor [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002953 (https://phabricator.wikimedia.org/T356149) [16:41:59] ^ the backports I’ll deploy soon :) [16:42:23] (03CR) 10BBlack: [C: 03+1] "Awesome!" [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) (owner: 10Ssingh) [16:42:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1146.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [16:42:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1146.eqiad.wmnet [16:44:38] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1146.eqiad.wmnet - https://phabricator.wikimedia.org/T357292 (10ABran-WMF) a:05ABran-WMF→03None [16:45:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002950 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [16:45:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002951 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [16:45:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002952 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [16:45:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002953 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [16:48:40] 10SRE, 10serviceops: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10MatthewVernon) [16:49:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) All work completed, no issues to report :) [16:49:27] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncmonitor1001.eqiad.wmnet with OS bookworm [16:52:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10MatthewVernon) Swift looks happy, thanks :) [16:55:12] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) @RHo Thank you! Something to consider is also looking at thumbnail sizes in other areas. Mostly notably: - In categories and galleries wher... [16:55:18] !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 0:05:00 on sretest1001.eqiad.wmnet with reason: training [16:55:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56712 and previous config saved to /var/cache/conftool/dbconfig/20240213-165527-ladsgroup.json [16:55:30] jouncebot nowandnext [16:55:30] For the next 0 hour(s) and 4 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1600) [16:55:30] In 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1700) [16:55:31] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1001.eqiad.wmnet with reason: training [16:56:25] brennen: nothing planned for the puppet request window but Lucas_WMDE already called dibs for some backports, coordinate amongst yourselves for the prod conch :) [16:57:12] I also realized I’m in a meeting in a few minutes, silly me [16:57:21] I think I’ll still be able to `y` my scap backport [16:57:25] (there’s not much to test anyways) [16:58:02] rzl: thank you. Lucas_WMDE, if you would be so good as to let me know when you're finished? [16:58:11] brennen: will do [16:58:16] thx! [16:58:28] brennen: I can also cancel my scap backport and let you go first? [16:58:32] and then do my backports afterwards [16:59:06] (I’d still let the gate-and-submit go through, to not wait for that again, and then just sync it out a bit later) [16:59:40] Lucas_WMDE: if you don't mind, i'll go ahead. just re-running scap train-presync. hopefully won't take too long this go around. [16:59:53] ok, Ctrl+Ced my scap backport [17:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:49] !log brennen@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.18 refs T354436 [17:01:18] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [17:03:40] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:04:14] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [17:06:38] (03CR) 10CI reject: [V: 04-1] Use EditEntity for ItemMergeInteractor [extensions/Wikibase] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002952 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [17:08:30] (03Merged) 10jenkins-bot: Use EditEntity for ItemMergeInteractor [extensions/Wikibase] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002950 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [17:08:44] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:51] (03Merged) 10jenkins-bot: Use EditEntity for MergeLexemesInteractor [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1002951 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [17:09:06] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445 (10phaultfinder) [17:09:37] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) >>! In T355863#9538876, @MatthewVernon wrote: > Swift looks happy, thanks :) great, thanks for the update! [17:09:57] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) test instance etherpad-bookworm.devtools now has etherpad-lite 1.9.7-2 installed by puppet [17:10:01] (03Merged) 10jenkins-bot: Use EditEntity for ItemMergeInteractor [extensions/Wikibase] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002952 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [17:10:29] (03Merged) 10jenkins-bot: Use EditEntity for MergeLexemesInteractor [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1002953 (https://phabricator.wikimedia.org/T356149) (owner: 10Lucas Werkmeister (WMDE)) [17:10:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56713 and previous config saved to /var/cache/conftool/dbconfig/20240213-171034-ladsgroup.json [17:13:24] (03PS1) 10Andrea Denisse: Revert "grafana: temp disable rsync stunnel for puppet7 migration" [puppet] - 10https://gerrit.wikimedia.org/r/1002954 [17:13:52] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10phaultfinder) [17:13:54] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445 (10phaultfinder) [17:14:48] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:15:44] 10SRE, 10serviceops: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10akosiaris) I 'd argue that the policy already covers this, even if it isn't scoped (on purpose) outside of kubernetes production realms. The biggest issue isn't the non-Debian base but rather the fact t... [17:15:50] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:18:16] (03CR) 10Andrea Denisse: [C: 03+2] Revert "grafana: temp disable rsync stunnel for puppet7 migration" [puppet] - 10https://gerrit.wikimedia.org/r/1002954 (owner: 10Andrea Denisse) [17:19:16] (03CR) 10Ssingh: [C: 03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 300s [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) (owner: 10Ssingh) [17:19:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:19:38] (03PS3) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 300s [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) [17:19:51] (03CR) 10Ssingh: "rebased, no zone file change" [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) (owner: 10Ssingh) [17:21:06] (03CR) 10Ssingh: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1002585 (https://phabricator.wikimedia.org/T140365) (owner: 10Ssingh) [17:22:09] 10SRE, 10serviceops: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441 (10MatthewVernon) Thanks for your comment. >>! In T357441#9539023, @akosiaris wrote: > The process to build images out of those isn't trivial, but it isn't difficult either. I was obviously unclear in wh... [17:23:33] !log running authdns-update to lower dyna TTLs: T140365 [17:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:38] T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 [17:23:50] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445 (10phaultfinder) [17:24:10] * Lucas_WMDE back [17:24:17] (03PS4) 10Jcrespo: mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) [17:24:19] (03PS4) 10Jcrespo: mediabackups: Add newly setup storage host backup2011 [puppet] - 10https://gerrit.wikimedia.org/r/995189 (https://phabricator.wikimedia.org/T334069) [17:24:21] (03PS1) 10Jcrespo: mariadb: Reenable notifications on backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1003050 (https://phabricator.wikimedia.org/T344036) [17:24:47] (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:14] (03PS2) 10Jcrespo: mariadb: Reenable notifications on backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1003050 (https://phabricator.wikimedia.org/T344036) [17:25:29] !log brennen@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.18 refs T354436 (duration: 24m 39s) [17:25:36] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [17:25:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T352010)', diff saved to https://phabricator.wikimedia.org/P56714 and previous config saved to /var/cache/conftool/dbconfig/20240213-172542-ladsgroup.json [17:25:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:25:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:25:51] ah, I was just wondering why I didn’t see the log in /var/lock/scap* anymore ^^ [17:25:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:26:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:26:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:26:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T352010)', diff saved to https://phabricator.wikimedia.org/P56715 and previous config saved to /var/cache/conftool/dbconfig/20240213-172620-ladsgroup.json [17:26:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1035.eqiad.wmnet with OS bullseye [17:26:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye [17:27:46] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:03] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-ui1001.eqiad.wmnet with OS bullseye [17:30:02] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:32:33] brennen: are you still deploying? [17:36:29] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=cp202(7|8).codfw.wmnet [17:36:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:37:10] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp[2027-2028].codfw.wmnet [17:37:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[2027-2028].codfw.wmnet [17:37:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:38:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) [17:38:25] Lucas_WMDE: you're good to go [17:38:36] alright, thanks! [17:38:42] (apologies for delay.) [17:38:48] np [17:39:08] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:39:09] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1002950|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002951|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]], [[gerrit:1002952|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002953|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]] [17:39:15] T356149: Adjust Item and Property Special Pages to not leak IPs when editing and IP masking is enabled - https://phabricator.wikimedia.org/T356149 [17:40:16] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:40:28] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1003009 (https://phabricator.wikimedia.org/T357333) (owner: 10Filippo Giunchedi) [17:40:40] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1002950|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002951|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]], [[gerrit:1002952|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002953|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]] synced to the testservers (https://wik [17:40:40] itech.wikimedia.org/wiki/Mwdebug) [17:41:03] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) [17:41:08] (03Abandoned) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [17:41:24] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) [17:41:32] (testing…) [17:41:35] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-ui1001.eqiad.wmnet with reason: host reimage [17:41:49] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1035.eqiad.wmnet with reason: host reimage [17:41:59] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Reenable notifications on backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1003050 (https://phabricator.wikimedia.org/T344036) (owner: 10Jcrespo) [17:42:21] no obvious breakage, let’s go [17:42:23] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [17:43:54] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-ui1001.eqiad.wmnet with reason: host reimage [17:46:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1035.eqiad.wmnet with reason: host reimage [17:47:46] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) One more data point: note that `gnt-instance console FQDN` is broken because of T309724 so we don't know the exact failure. [17:49:21] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1002950|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002951|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]], [[gerrit:1002952|Use EditEntity for ItemMergeInteractor (T356149 T356764)]], [[gerrit:1002953|Use EditEntity for MergeLexemesInteractor (T356149 T356764)]] (duration: 10m 11s) [17:49:26] T356149: Adjust Item and Property Special Pages to not leak IPs when editing and IP masking is enabled - https://phabricator.wikimedia.org/T356149 [17:50:09] jouncebot: nowandnext [17:50:09] For the next 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1700) [17:50:10] In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1800) [17:50:26] I’m watching logstash for a bit but otherwise done deploying [17:50:34] (03CR) 10Ladsgroup: [C: 03+2] ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad) [17:51:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad) [17:51:36] (03Merged) 10jenkins-bot: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) (owner: 10Ammarpad) [17:52:02] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:992206|ruwiki: Add 'edituserjson' right to 'engineers' group (T355499)]] [17:52:07] T355499: Add "edituserjson" permission to ruwiki's "engineer" usergroup - https://phabricator.wikimedia.org/T355499 [17:53:28] !log ladsgroup@deploy2002 ammarpad and ladsgroup: Backport for [[gerrit:992206|ruwiki: Add 'edituserjson' right to 'engineers' group (T355499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:53:31] !log ladsgroup@deploy2002 ammarpad and ladsgroup: Continuing with sync [17:54:00] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is c... [17:56:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T352010)', diff saved to https://phabricator.wikimedia.org/P56716 and previous config saved to /var/cache/conftool/dbconfig/20240213-175617-ladsgroup.json [17:56:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1800) [18:00:31] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:992206|ruwiki: Add 'edituserjson' right to 'engineers' group (T355499)]] (duration: 08m 28s) [18:00:47] T355499: Add "edituserjson" permission to ruwiki's "engineer" usergroup - https://phabricator.wikimedia.org/T355499 [18:01:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:04:09] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots... [18:06:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:47] (ProbeDown) firing: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:11:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56717 and previous config saved to /var/cache/conftool/dbconfig/20240213-181124-ladsgroup.json [18:13:49] (ProbeDown) resolved: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56718 and previous config saved to /var/cache/conftool/dbconfig/20240213-182630-ladsgroup.json [18:30:02] (03PS1) 10Brouberol: hue: uninstall hue from an-test-ui as it's being deprecated [puppet] - 10https://gerrit.wikimedia.org/r/1003067 (https://phabricator.wikimedia.org/T357448) [18:31:35] (03CR) 10Brouberol: [C: 03+2] hue: uninstall hue from an-test-ui as it's being deprecated [puppet] - 10https://gerrit.wikimedia.org/r/1003067 (https://phabricator.wikimedia.org/T357448) (owner: 10Brouberol) [18:36:59] 10SRE, 10Domains, 10Fundraising-Backlog: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10ssingh) [18:38:16] (03PS1) 10Brouberol: hue: remove any hue-related hiera for an-test-ui [puppet] - 10https://gerrit.wikimedia.org/r/1003069 (https://phabricator.wikimedia.org/T357448) [18:39:04] (03PS2) 10Brouberol: hue: remove any hue-related hiera for an-test-ui [puppet] - 10https://gerrit.wikimedia.org/r/1003069 (https://phabricator.wikimedia.org/T357448) [18:40:53] (03CR) 10Bking: [C: 03+1] hue: remove any hue-related hiera for an-test-ui [puppet] - 10https://gerrit.wikimedia.org/r/1003069 (https://phabricator.wikimedia.org/T357448) (owner: 10Brouberol) [18:41:19] (03CR) 10Brouberol: [C: 03+2] hue: remove any hue-related hiera for an-test-ui [puppet] - 10https://gerrit.wikimedia.org/r/1003069 (https://phabricator.wikimedia.org/T357448) (owner: 10Brouberol) [18:41:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T352010)', diff saved to https://phabricator.wikimedia.org/P56720 and previous config saved to /var/cache/conftool/dbconfig/20240213-184137-ladsgroup.json [18:41:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:41:42] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:41:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:41:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P56721 and previous config saved to /var/cache/conftool/dbconfig/20240213-184159-ladsgroup.json [18:42:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:21] (03PS1) 10Dzahn: phabricator: re-activate public dump job [puppet] - 10https://gerrit.wikimedia.org/r/1003070 (https://phabricator.wikimedia.org/T355502) [18:43:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:59] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-ui1001.eqiad.wmnet with OS bullseye [18:45:07] (03PS2) 10Dzahn: phabricator: re-activate public dump job [puppet] - 10https://gerrit.wikimedia.org/r/1003070 (https://phabricator.wikimedia.org/T355502) [18:46:26] (03PS2) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) [18:47:17] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) Thanks for the response, @Volans >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, i... [18:47:30] (03PS1) 10CDobbins: P:dns::recursor dns-recursor: small change to experiment with Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1003071 [18:48:27] (03PS2) 10Ssingh: P:dns::recursor: small change to experiment with Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1003071 (owner: 10CDobbins) [18:50:05] (03PS1) 10Dzahn: site: apply etherpad role on both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) [18:50:22] (03CR) 10Dzahn: [C: 03+1] P:dns::recursor: small change to experiment with Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1003071 (owner: 10CDobbins) [18:51:49] (03CR) 10CDobbins: [V: 03+2 C: 03+2] P:dns::recursor: small change to experiment with Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1003071 (owner: 10CDobbins) [18:53:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:28] (03PS1) 10Dzahn: site: remove etherpad on bullseye machine [puppet] - 10https://gerrit.wikimedia.org/r/1003075 (https://phabricator.wikimedia.org/T316421) [18:55:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 41.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:55:20] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:04] jeena and brennen: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1900). Please do the needful. [19:00:17] o/ [19:01:38] !log train 1.42.0-wmf.18 (T354436): no current blockers, rolling to group0. [19:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:54] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:02:20] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003082 (https://phabricator.wikimedia.org/T354436) [19:02:22] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003082 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:02:32] (03PS1) 10Krinkle: Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) [19:03:24] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003082 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:05:50] (03CR) 10Krinkle: "MW devs keep finding these from time to time at https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors, so maybe this makes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [19:07:27] (03PS1) 10Dzahn: nagios_common/planet: remove check_lastmod check, script and config [puppet] - 10https://gerrit.wikimedia.org/r/1003084 (https://phabricator.wikimedia.org/T353298) [19:10:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:11:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P56722 and previous config saved to /var/cache/conftool/dbconfig/20240213-191142-ladsgroup.json [19:11:45] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.18 refs T354436 [19:11:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:11:56] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:15:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:21] PROBLEM - Dell PowerEdge RAID Controller on an-worker1173 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [19:20:22] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-worker1173 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T357460 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [19:20:26] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1173 - https://phabricator.wikimedia.org/T357460 (10ops-monitoring-bot) [19:21:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56723 and previous config saved to /var/cache/conftool/dbconfig/20240213-192648-ladsgroup.json [19:38:03] 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus) Hi from Service Ops SRE! @AKanji-WMF How long would you like the redirect to stay active? Adding @Jgreen and @Dwisehaupt from FR-Tech SRE, as I'm not sure where we'v... [19:41:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56724 and previous config saved to /var/cache/conftool/dbconfig/20240213-194155-ladsgroup.json [19:42:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:36] (03CR) 10Jforrester: [C: 03+1] Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [19:44:09] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:44:39] (03CR) 10Ladsgroup: [C: 03+1] mariadb: disable systematic wiping of /srv on db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [19:46:27] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:53:36] (03PS6) 10Gmodena: add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [19:57:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T352010)', diff saved to https://phabricator.wikimedia.org/P56725 and previous config saved to /var/cache/conftool/dbconfig/20240213-195701-ladsgroup.json [19:57:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [19:57:06] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:57:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [19:57:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T352010)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240213-195724-ladsgroup.json [20:01:17] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:02:25] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:10:06] (03PS1) 10Dzahn: icinga: delete monitoring class for planet [puppet] - 10https://gerrit.wikimedia.org/r/1003098 [20:11:00] (03CR) 10Dzahn: "the question was still if anyting else monitors expiry of the cert and letsencrypt certs in general" [puppet] - 10https://gerrit.wikimedia.org/r/1003098 (owner: 10Dzahn) [20:13:48] jouncebot nowandnext [20:13:48] For the next 0 hour(s) and 46 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T1900) [20:13:48] In 0 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T2100) [20:14:10] going to roll out a small phabricator/phorge deploy [20:20:00] !log brennen@deploy2002 Started deploy [phabricator/deployment@f4a7f50]: test deploy to phab2002 for T357464 [20:20:13] T357464: Deploy Phabricator/Phorge week of 2024-02-12 - https://phabricator.wikimedia.org/T357464 [20:20:29] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f4a7f50]: test deploy to phab2002 for T357464 (duration: 00m 29s) [20:21:18] !log brennen@deploy2002 Started deploy [phabricator/deployment@f4a7f50]: deploy to phab1004 for T357464 [20:22:07] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f4a7f50]: deploy to phab1004 for T357464 (duration: 00m 48s) [20:22:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56727 and previous config saved to /var/cache/conftool/dbconfig/20240213-202254-ladsgroup.json [20:23:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:23:54] !log phab1004 - running public_task_dump.py T355502 [20:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:58] T355502: phabricator_task_dump.service Failed on phab1004 - https://phabricator.wikimedia.org/T355502 [20:26:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:11] 10SRE, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10ssingh) 05Open→03Resolved We have rolled this out today. For a complete list of domains affected, see the commit above. [20:38:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56728 and previous config saved to /var/cache/conftool/dbconfig/20240213-203800-ladsgroup.json [20:38:10] (03PS6) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [20:38:22] (03CR) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [20:52:46] (03PS3) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1007 [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) [20:53:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56729 and previous config saved to /var/cache/conftool/dbconfig/20240213-205307-ladsgroup.json [20:53:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P56730 and previous config saved to /var/cache/conftool/dbconfig/20240213-205308-ladsgroup.json [20:57:42] (03PS1) 10Bking: cloudelastic: Add already-migrated hosts as master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003100 (https://phabricator.wikimedia.org/T355617) [20:59:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [20:59:37] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240213T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:06:37] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:45] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003100 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:08:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56731 and previous config saved to /var/cache/conftool/dbconfig/20240213-210813-ladsgroup.json [21:08:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P56732 and previous config saved to /var/cache/conftool/dbconfig/20240213-210814-ladsgroup.json [21:08:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:23:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T352010)', diff saved to https://phabricator.wikimedia.org/P56733 and previous config saved to /var/cache/conftool/dbconfig/20240213-212321-ladsgroup.json [21:23:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [21:23:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:23:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [21:23:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56734 and previous config saved to /var/cache/conftool/dbconfig/20240213-212343-ladsgroup.json [21:23:45] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:24:47] (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:55] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:28:25] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:33] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:43:15] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:33] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:53:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:54:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:56:37] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [21:57:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10Volans) Should we have `/var/lib/ganeti/known_hosts` be managed by Puppet? [22:05:32] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:08:26] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:16:50] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:19:07] weird, is wikibugs sending the same message multiple times? [22:37:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:38:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:45:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:35] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:56:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:56:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:40] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) >>! In T357449#9540286, @Volans wrote: > 2) If you run `sudo gnt-instance console --show-cmd ncmonitor1001.eqiad.wmnet` it's very easy to see the co... [22:58:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:01:42] jouncebot: now [23:01:42] No deployments scheduled for the next 7 hour(s) and 58 minute(s) [23:04:30] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) > just a normal d-i partman configuration issue The code change from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674/3/modules/profil... [23:04:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:59] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:32] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it... [23:11:56] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jhathaway) @Muehlenhoff I think that makes sense, are the updates run manually whe... [23:24:16] 10SRE, 10SRE-Access-Requests: Requesting access to user information table for rkhan - https://phabricator.wikimedia.org/T357483 (10Himejijo) [23:42:47] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it... [23:44:25] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @Volans Ah, yes, i can get a console when running `sudo /usr/lib/ganeti/tools/kvm-console-wrapper /usr/bin/socat ncmonitor1001.eqiad.wmnet /var/run/... [23:48:10] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @BCornwall .. but then after installing the base system it fails at installing grub in /dev/sda.. which is not expected. [23:55:12] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm