[00:02:25] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:03:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:08:45] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [00:08:58] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [00:32:07] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986833 [00:38:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986833 (owner: 10TrainBranchBot) [00:42:08] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:55:09] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [00:55:25] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [00:59:38] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [01:03:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986833 (owner: 10TrainBranchBot) [01:13:45] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [01:13:48] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [01:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:32] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [01:16:36] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [01:16:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:17:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:37:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:46] (03PS1) 10Samwilson: Edit Recovery: Add config.json to special page and postEdit [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986657 (https://phabricator.wikimedia.org/T354167) [02:04:45] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:37:08] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:15] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:12:08] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:37:42] (03CR) 10Samwilson: [C: 03+2] Edit Recovery: Add config.json to special page and postEdit [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986657 (https://phabricator.wikimedia.org/T354167) (owner: 10Samwilson) [04:54:48] (03Merged) 10jenkins-bot: Edit Recovery: Add config.json to special page and postEdit [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986657 (https://phabricator.wikimedia.org/T354167) (owner: 10Samwilson) [04:59:49] (03PS5) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [05:00:41] (03CR) 10CI reject: [V: 04-1] Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:13:18] (03PS6) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [05:18:49] PROBLEM - OpenSearch health check for shards on 9200 on logstash2033 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:24:51] PROBLEM - MD RAID on logstash2033 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:32:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on logstash2033:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [05:56:02] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10Aklapper) @SLyngshede-WMF: Is this a solution in search of a problem maybe? If there is a problem, is there data how big the problem is? [06:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:13:39] (03CR) 10Legoktm: [C: 03+1] "Awesome" [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) (owner: 10Hashar) [06:15:52] 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Legoktm) [06:15:55] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Legoktm) [06:19:46] 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Legoktm) Sorry for the late response. What @Peachey88 linked to is basically still up to date on the lack of API access inside the prod netw... [06:26:49] (03PS1) 10Legoktm: lists: Allow images from upload.wikimedia.org in CSP [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) [06:28:45] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Aklapper) 16 months later, can we realistically agree that no further work will happen and resolve t... [06:30:31] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:30:59] 10SRE, 10Wikimedia-Mailing-lists, 10ContentSecurityPolicy, 10Patch-For-Review: Icon of daily-image-l broken by CSP - https://phabricator.wikimedia.org/T353755 (10Legoktm) a:03Legoktm [06:32:25] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:41] (03CR) 10Marostegui: [C: 03+1] Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [06:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T0700) [07:04:29] (03PS1) 10Marostegui: pc2015: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/987357 [07:05:34] (03CR) 10Marostegui: [C: 03+2] pc2015: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/987357 (owner: 10Marostegui) [07:14:46] (03CR) 10Hashar: [C: 03+1] "Thank you for the verification!" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [07:19:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:19] (03PS1) 10Marostegui: db1151: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/987359 [07:27:12] (03CR) 10Marostegui: [C: 03+2] db1151: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/987359 (owner: 10Marostegui) [07:28:31] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:49] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:44:09] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:30] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10MoritzMuehlenhoff) >>! In T335478#9417062, @bd808 wrote: >> Define what it means for an account to be inactive. > > What is the thinking that leads to an assumption that "... [08:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:15] (03Abandoned) 10Giuseppe Lavagetto: mobileapps: move 20% of replicas to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973184 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [08:18:25] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: enable kafka partition discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer) [08:19:26] (03Merged) 10jenkins-bot: Search update pipeline: enable kafka partition discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer) [08:20:57] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10CodeReviewBot) pfischer merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_req... [08:28:24] (03PS1) 10Muehlenhoff: Remove access for nickifeajika [puppet] - 10https://gerrit.wikimedia.org/r/987394 [08:33:01] (03CR) 10Majavah: [C: 04-1] lists: Allow images from upload.wikimedia.org in CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [08:35:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nickifeajika [puppet] - 10https://gerrit.wikimedia.org/r/987394 (owner: 10Muehlenhoff) [08:37:50] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987395 (https://phabricator.wikimedia.org/T354064) [08:38:25] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987395 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer) [08:39:26] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987395 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer) [08:40:14] 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Joe) Yes, your understanding is correct; I had a patch fixing this that never got merged, I should just make a new version of that. [08:44:09] (03PS1) 10Slyngshede: Add warning for OOM killer. [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) [08:47:00] (03PS1) 10Muehlenhoff: Remove LDAP access for kassiameq [puppet] - 10https://gerrit.wikimedia.org/r/987399 [08:48:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for kassiameq [puppet] - 10https://gerrit.wikimedia.org/r/987399 (owner: 10Muehlenhoff) [08:54:08] (03PS4) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 [08:54:10] (03PS1) 10Giuseppe Lavagetto: Fix timeouts detection on mw on k8s jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) [08:54:16] <_joe_> jouncebot: nowandnext [08:54:16] For the next 0 hour(s) and 5 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T0800) [08:54:16] In 2 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1100) [08:54:41] <_joe_> urbanecm: around? I have a fix for T354229 and I'd like to deploy it asap [08:54:42] T354229: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 [08:54:46] <_joe_> but I'd use a review [08:55:50] (03CR) 10Slyngshede: "I've avoided alerting on Kubernetes for now, as the CODFW is running the OOM killer constantly" [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:56:46] (03PS1) 10MVernon: Correct typo in cumin1001 warning message [puppet] - 10https://gerrit.wikimedia.org/r/987401 (https://phabricator.wikimedia.org/T353419) [08:58:10] (03CR) 10Majavah: [C: 04-1] Disable things that don't work on k8s when on k8s (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [08:58:16] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987401 (https://phabricator.wikimedia.org/T353419) (owner: 10MVernon) [08:59:01] (03CR) 10Marostegui: [C: 03+1] Correct typo in cumin1001 warning message [puppet] - 10https://gerrit.wikimedia.org/r/987401 (https://phabricator.wikimedia.org/T353419) (owner: 10MVernon) [09:02:19] (03CR) 10Majavah: [C: 03+1] "One question, otherwise LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [09:03:14] _joe_: not Martin, but hopefully my reviews are of use too [09:03:32] <_joe_> taavi: absolutely, I went to him because he opened the task [09:03:56] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:05:05] (03CR) 10MVernon: [C: 03+2] Correct typo in cumin1001 warning message [puppet] - 10https://gerrit.wikimedia.org/r/987401 (https://phabricator.wikimedia.org/T353419) (owner: 10MVernon) [09:07:23] (03PS5) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 [09:07:25] (03PS2) 10Giuseppe Lavagetto: Fix timeouts detection on mw on k8s jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) [09:07:46] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:08:00] (03CR) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [09:08:30] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:10:28] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:10:32] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:12:50] (03CR) 10Majavah: [C: 03+1] Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [09:13:12] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:16:01] (03CR) 10Volans: [C: 03+2] Use setuptools_scm to set the version [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155 (owner: 10Volans) [09:16:56] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/979098 (https://phabricator.wikimedia.org/T351059) (owner: 10Volans) [09:17:11] (03CR) 10CI reject: [V: 04-1] CI: test apt_repo failures [puppet] - 10https://gerrit.wikimedia.org/r/979098 (https://phabricator.wikimedia.org/T351059) (owner: 10Volans) [09:17:37] (03Merged) 10jenkins-bot: Use setuptools_scm to set the version [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155 (owner: 10Volans) [09:21:00] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:21:05] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:21:06] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:21:06] (03Abandoned) 10Volans: CI: test apt_repo failures [puppet] - 10https://gerrit.wikimedia.org/r/979098 (https://phabricator.wikimedia.org/T351059) (owner: 10Volans) [09:21:14] (03CR) 10Muehlenhoff: [C: 03+2] vtrs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987149 (owner: 10Muehlenhoff) [09:21:43] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:21:44] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:26:36] (03PS1) 10Muehlenhoff: peopleweb: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987402 [09:30:21] (03PS1) 10Muehlenhoff: rancid/librenms: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987403 [09:30:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987402 (owner: 10Muehlenhoff) [09:31:02] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:31:03] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:32:02] (03CR) 10Clément Goubert: [C: 03+1] Fix timeouts detection on mw on k8s jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [09:32:30] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:32:31] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:32:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on logstash2033:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [09:33:57] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:33:59] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:35:26] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:35:27] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:36:38] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:36:39] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:37:31] (03PS2) 10Muehlenhoff: peopleweb: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987402 [09:37:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987402 (owner: 10Muehlenhoff) [09:38:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987403 (owner: 10Muehlenhoff) [09:39:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987031 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [09:39:59] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:40:00] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:44:16] (03PS1) 10Muehlenhoff: doc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987406 [09:44:26] (03PS2) 10Muehlenhoff: doc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987406 [09:44:59] I'm seeing a few ongoing alerts of the kind "No Puppet resources found on instance X on project Y". Are they new-ish? "instance X" seems to be indeed often defunct and maybe has some leftover that need to be cleaned up, but I'm not sure where to look for them. Does someone have some pointers? [09:48:35] (03PS1) 10Muehlenhoff: grafana: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987407 [09:49:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987406 (owner: 10Muehlenhoff) [09:50:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987407 (owner: 10Muehlenhoff) [09:57:10] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:01:03] PROBLEM - Query Service HTTP Port on wdqs1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:08:16] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:09:07] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:10:05] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:11:06] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:11:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] Always process media files via shellbox on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [10:11:38] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:13:40] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:15:21] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:16:25] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:19:04] ACKNOWLEDGEMENT - MD RAID on logstash2033 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354249 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:19:08] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10ops-monitoring-bot) [10:23:29] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on dumpsdata1006 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354250 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:23:29] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354250 (10ops-monitoring-bot) [10:23:50] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:24:19] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354143 (10Volans) [10:24:34] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354250 (10Volans) [10:25:13] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987407 (owner: 10Muehlenhoff) [10:26:34] ACKNOWLEDGEMENT - MD RAID on ganeti1031 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354251 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:26:35] 10SRE, 10ops-eqiad: Degraded RAID on ganeti1031 - https://phabricator.wikimedia.org/T354251 (10ops-monitoring-bot) [10:31:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:33] (03PS1) 10Volans: raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 [10:34:20] (03CR) 10CI reject: [V: 04-1] raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 (owner: 10Volans) [10:35:11] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:35:12] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:37:22] (03PS2) 10Volans: raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 [10:39:40] (03CR) 10Clément Goubert: [C: 03+1] Add warning for OOM killer. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:40:01] (03CR) 10Giuseppe Lavagetto: Fix timeouts detection on mw on k8s jobrunners (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [10:44:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "thanks for the patch, but I think for now it can be limited to the patch to __init__.py" [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah) [10:46:24] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:46:25] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:47:06] (03PS3) 10Majavah: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) [10:47:08] (03PS2) 10Majavah: tox: show black diff on failure [software/conftool] - 10https://gerrit.wikimedia.org/r/987170 [10:47:26] (03CR) 10Majavah: cli: Fix IRC logging (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah) [10:48:07] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:48:08] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:51:41] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:51:42] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:53:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:56:08] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:58:16] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [10:59:02] _joe_: hi, sorry, i just saw your ping. patch looks good to me; thanks for getting to the task so quickly! :) [10:59:18] <_joe_> urbanecm: ack [10:59:25] <_joe_> I'll deploy in a few [10:59:32] ack, sounds good [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1100) [11:01:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] Fix timeouts detection on mw on k8s jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [11:04:55] 10SRE, 10ops-eqiad: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T354253 (10ops-monitoring-bot) [11:05:04] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:05:42] 10SRE, 10ops-eqiad: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T354253 (10Volans) 05Open→03Invalid This was a test to ensure the raid handler fix is working. Sorry for the noise. [11:05:52] <_joe_> ok I guess the time is right to deploy my patches [11:06:45] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: get-raid-status-perccli not working as expected - https://phabricator.wikimedia.org/T354254 (10Volans) p:05Triage→03Medium [11:07:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [11:08:21] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [11:08:25] (03Merged) 10jenkins-bot: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [11:08:35] (03PS3) 10Volans: raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 [11:10:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice idea. Thanks for that. Couple of comments inline." [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:10:46] <_joe_> urbanecm: uhm I might need help [11:11:04] <_joe_> scap backport tells me it found Iaa0f3fb3cb798ae7a4f9e0f6259fbe6fdc3d30cd still needs to be merged [11:11:23] <_joe_> "11:09:35 The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.42.0-wmf.12" [11:14:11] hmmm [11:15:28] <_joe_> taavi: the backport patch was merged this morning at 6 am our time and I guess never deployed [11:15:49] yeah, and samwilson is not here, let's just revert it [11:15:59] (03PS1) 10Majavah: Revert "Edit Recovery: Add config.json to special page and postEdit" [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986661 [11:16:21] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:951049|Disable things that don't work on k8s when on k8s]] [11:16:33] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/986661/ [11:18:11] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:951049|Disable things that don't work on k8s when on k8s]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:18:42] (03Abandoned) 10Majavah: Revert "Edit Recovery: Add config.json to special page and postEdit" [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986661 (owner: 10Majavah) [11:20:32] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:20:35] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:21:22] (03PS1) 10Muehlenhoff: mwlog: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987415 [11:21:43] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:21:45] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:22:57] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:22:59] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:23:36] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:23:38] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:23:53] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:23:55] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:24:17] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:24:19] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:24:35] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:24:37] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:25:08] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:25:25] !log oblivian@deploy2002 oblivian: Continuing with sync [11:28:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/987409 (owner: 10Volans) [11:28:17] (03PS1) 10Stang: ganwiki: Add transwiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) [11:29:04] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987407 (owner: 10Muehlenhoff) [11:29:21] (03CR) 10CI reject: [V: 04-1] ganwiki: Add transwiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) (owner: 10Stang) [11:30:48] (03PS2) 10Stang: ganwiki: Add transwiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) [11:31:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987415 (owner: 10Muehlenhoff) [11:31:51] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:951049|Disable things that don't work on k8s when on k8s]] (duration: 15m 29s) [11:33:43] `Disable things that don't work on k8s when on k8s` is beautifully named, thank you :p [11:34:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [11:35:41] (03Merged) 10jenkins-bot: Fix timeouts detection on mw on k8s jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987400 (https://phabricator.wikimedia.org/T354229) (owner: 10Giuseppe Lavagetto) [11:36:02] 10sre-alert-triage, 10SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255 (10LSobanski) [11:36:05] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]] [11:36:09] T354229: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 [11:36:33] (03PS1) 10Btullis: Enable monitoring for dbstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/987419 (https://phabricator.wikimedia.org/T351921) [11:37:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch testreduce to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984159 (owner: 10Muehlenhoff) [11:37:48] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:37:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1015/console" [puppet] - 10https://gerrit.wikimedia.org/r/987419 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [11:39:46] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:40:08] (03PS2) 10Muehlenhoff: piwik: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/984163 [11:40:53] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:41:36] !log oblivian@deploy2002 oblivian: Continuing with sync [11:41:52] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:44:11] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:44:21] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:47:43] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]] (duration: 11m 38s) [11:47:47] T354229: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 [11:52:45] (03PS1) 10Btullis: Disable monitoring on dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/987420 (https://phabricator.wikimedia.org/T351921) [12:01:54] !log installing gnutls28 security updates on buster [12:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:08] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:02:31] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:04:46] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10MatthewVernon) [12:04:58] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10MatthewVernon) [removing swift-storage tag as none of the relevant swift nodes are still in production] [12:06:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] jaeger: add oauth2-proxy sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [12:08:55] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [12:13:44] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:14:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [12:14:03] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:18:24] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [12:23:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [12:24:15] (03PS2) 10Slyngshede: Add warning for OOM killer. [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) [12:26:30] (03CR) 10Slyngshede: Add warning for OOM killer. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:28:17] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master [12:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master [12:32:56] 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) 05Open→03Resolved a:03Joe [12:34:00] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:34:35] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:04:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:09:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:25:59] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: get-raid-status-perccli not working as expected - https://phabricator.wikimedia.org/T354254 (10SLyngshede-WMF) a:03SLyngshede-WMF I believe I wrote this, so I'll fix it :-) [13:26:03] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10brouberol) ` brouberol@kafka-test1010:~$ kafka topics --topic codfw.cirrussearch.update_pipeline.update.rc0 --alter --partitions 5 kafka-top... [13:26:59] (PuppetFailure) firing: Puppet has failed on testreduce1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:27:02] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10brouberol) [13:28:09] (03CR) 10Filippo Giunchedi: [C: 03+1] mwlog: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987415 (owner: 10Muehlenhoff) [13:28:30] (03CR) 10Filippo Giunchedi: [C: 03+1] raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 (owner: 10Volans) [13:29:04] !log installing Java 8/11 security updates [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:32] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:17] (03PS1) 10Btullis: Switch s7-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987425 (https://phabricator.wikimedia.org/T351921) [13:31:18] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:31:19] (03PS1) 10Btullis: Switch s5-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987426 (https://phabricator.wikimedia.org/T351921) [13:31:21] (03PS1) 10Btullis: Switch s1-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987427 (https://phabricator.wikimedia.org/T351921) [13:31:30] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Nick Ifeajika out of all services on: 2220 hosts [13:32:16] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nick Ifeajika out of all services on: 2220 hosts [13:32:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on logstash2033:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [13:33:49] (03CR) 10Marostegui: [C: 03+1] Enable monitoring for dbstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/987419 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:34:04] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable monitoring for dbstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/987419 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:34:50] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@43623.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:12] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:58] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:12] (03CR) 10Marostegui: [C: 03+1] Switch s7-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987425 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:36:39] (03CR) 10Marostegui: [C: 03+1] Switch s1-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987427 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:36:51] (03CR) 10Marostegui: [C: 03+1] Switch s5-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987426 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:43:53] jouncebot: nowandnext [13:43:53] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [13:43:53] In 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1400) [13:45:28] (03PS1) 10Samtar: Edit Recovery: fix typo in expiry field name [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986662 (https://phabricator.wikimedia.org/T347673) [13:55:04] (03PS3) 10Samtar: ganwiki: Add transwiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) (owner: 10Stang) [13:55:10] (03PS2) 10Samtar: zhwikivoyage: Enable block feature for abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985376 (https://phabricator.wikimedia.org/T353604) (owner: 10Stang) [13:56:55] koi: ready a little early? [13:59:10] yep :) [13:59:31] going to run both your config patches together [13:59:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985376 (https://phabricator.wikimedia.org/T353604) (owner: 10Stang) [13:59:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) (owner: 10Stang) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1400) [14:00:05] koi and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] * TheresNoTime is deploying [14:00:17] go ahead :) [14:00:28] (03Merged) 10jenkins-bot: zhwikivoyage: Enable block feature for abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985376 (https://phabricator.wikimedia.org/T353604) (owner: 10Stang) [14:00:32] (03Merged) 10jenkins-bot: ganwiki: Add transwiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987417 (https://phabricator.wikimedia.org/T354000) (owner: 10Stang) [14:00:40] * urbanecm is being disliked by Okta and got kicked out [14:01:00] !log samtar@deploy2002 Started scap: Backport for [[gerrit:985376|zhwikivoyage: Enable block feature for abusefilter (T353604)]], [[gerrit:987417|ganwiki: Add transwiki import sources (T354000)]] [14:01:06] T353604: Enable blocking feature of abuse filter in zhwikivoyage - https://phabricator.wikimedia.org/T353604 [14:01:07] T354000: Add transwiki import sources for gan.wikipedia - https://phabricator.wikimedia.org/T354000 [14:01:20] Okta — security by just always logging you out(tm) [14:01:43] (03PS7) 10Majavah: admin: POC: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [14:01:45] (03PS5) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [14:02:11] (03PS8) 10Majavah: admin: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 [14:02:15] (03PS6) 10Majavah: admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 [14:02:28] !log samtar@deploy2002 samtar and stang: Backport for [[gerrit:985376|zhwikivoyage: Enable block feature for abusefilter (T353604)]], [[gerrit:987417|ganwiki: Add transwiki import sources (T354000)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:02:36] koi: both ready for testing [14:02:42] looking [14:03:48] !log installing qemu security updates [14:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:22] (03PS11) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:04:53] TheresNoTime, both tested and lgtm [14:05:00] !log samtar@deploy2002 samtar and stang: Continuing with sync [14:05:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:06:14] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:06:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:06:47] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:06:59] (PuppetFailure) resolved: Puppet has failed on testreduce1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:07:24] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986662 (https://phabricator.wikimedia.org/T347673) (owner: 10Samtar) [14:10:58] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:985376|zhwikivoyage: Enable block feature for abusefilter (T353604)]], [[gerrit:987417|ganwiki: Add transwiki import sources (T354000)]] (duration: 09m 58s) [14:11:06] T353604: Enable blocking feature of abuse filter in zhwikivoyage - https://phabricator.wikimedia.org/T353604 [14:11:06] koi: live on prod :) [14:11:06] T354000: Add transwiki import sources for gan.wikipedia - https://phabricator.wikimedia.org/T354000 [14:11:13] ty [14:11:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add warning for OOM killer. [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:11:33] * TheresNoTime now waits for https://gerrit.wikimedia.org/r/c/986662/ to merge [14:15:37] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [14:15:40] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T354276 (10Dima) [14:16:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] Use shellbox for djvu handling on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [14:17:31] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:17:45] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:18:02] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:18:11] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:20:01] TheresNoTime: I have 2 patches for logo change, is it possible to deploy [14:20:09] 10SRE, 10Infrastructure-Foundations: Processing of config file includes broken in Buster / nftables 0.9.0 - https://phabricator.wikimedia.org/T354279 (10MoritzMuehlenhoff) [14:20:42] anzx: can you add them to the calendar and I can do them after my core backport [14:20:51] Ok [14:20:59] (03PS5) 10Anzx: aswikiquote: change wordmark and update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985389 (https://phabricator.wikimedia.org/T353934) [14:21:07] (03PS4) 10Anzx: zhwikinews: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986658 (https://phabricator.wikimedia.org/T353792) [14:21:50] TheresNoTime: added to calendar [14:22:01] thanks, will get to them shortly! [14:26:23] (03Merged) 10jenkins-bot: Edit Recovery: fix typo in expiry field name [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986662 (https://phabricator.wikimedia.org/T347673) (owner: 10Samtar) [14:27:02] !log samtar@deploy2002 Started scap: Backport for [[gerrit:986662|Edit Recovery: fix typo in expiry field name (T347673)]] [14:27:06] T347673: Create special page to list all recovery data - https://phabricator.wikimedia.org/T347673 [14:27:10] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:30] !log samtar@deploy2002 samtar: Backport for [[gerrit:986662|Edit Recovery: fix typo in expiry field name (T347673)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:33] * TheresNoTime testing [14:28:55] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:01] !log samtar@deploy2002 samtar: Continuing with sync [14:31:23] not related to this deploy, but there's a lot of noise in the mwdebug logstash dashboard [14:32:09] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:39] anzx: ready for your patches? [14:34:46] Yes [14:34:48] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:986662|Edit Recovery: fix typo in expiry field name (T347673)]] (duration: 07m 46s) [14:34:52] T347673: Create special page to list all recovery data - https://phabricator.wikimedia.org/T347673 [14:34:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985389 (https://phabricator.wikimedia.org/T353934) (owner: 10Anzx) [14:35:40] (03Merged) 10jenkins-bot: aswikiquote: change wordmark and update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985389 (https://phabricator.wikimedia.org/T353934) (owner: 10Anzx) [14:36:05] !log samtar@deploy2002 Started scap: Backport for [[gerrit:985389|aswikiquote: change wordmark and update logo (T353934)]] [14:36:14] T353934: Fix Assamese Wikiquote Logo - https://phabricator.wikimedia.org/T353934 [14:36:39] anzx: that one is ready for testing [14:37:03] (03PS5) 10Samtar: zhwikinews: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986658 (https://phabricator.wikimedia.org/T353792) (owner: 10Anzx) [14:37:09] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:31] Ok, testing [14:37:36] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:985389|aswikiquote: change wordmark and update logo (T353934)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:38:11] TheresNoTime: looks good [14:38:16] !log samtar@deploy2002 samtar and anzx: Continuing with sync [14:42:40] (03PS1) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) [14:43:57] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:985389|aswikiquote: change wordmark and update logo (T353934)]] (duration: 07m 51s) [14:44:05] T353934: Fix Assamese Wikiquote Logo - https://phabricator.wikimedia.org/T353934 [14:44:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986658 (https://phabricator.wikimedia.org/T353792) (owner: 10Anzx) [14:44:31] anzx: one done, moving onto the next [14:44:37] Ok [14:44:54] (03PS1) 10Giuseppe Lavagetto: Explicitly disable all local imagescaling on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) [14:44:56] (03Merged) 10jenkins-bot: zhwikinews: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986658 (https://phabricator.wikimedia.org/T353792) (owner: 10Anzx) [14:45:06] (03CR) 10CI reject: [V: 04-1] Explicitly disable all local imagescaling on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [14:45:18] !log samtar@deploy2002 Started scap: Backport for [[gerrit:986658|zhwikinews: update wordmark (T353792)]] [14:45:22] T353792: Logo of zh WikiNews has background color instead of alpha channel (visible in Minerva) - https://phabricator.wikimedia.org/T353792 [14:46:53] !log samtar@deploy2002 anzx and samtar: Backport for [[gerrit:986658|zhwikinews: update wordmark (T353792)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:47:01] Checking [14:47:06] :) [14:48:31] TheresNoTime: looks good [14:48:36] !log samtar@deploy2002 anzx and samtar: Continuing with sync [14:49:10] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1022/co" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:49:38] 10SRE, 10ops-eqiad: Degraded RAID on ganeti1031 - https://phabricator.wikimedia.org/T354251 (10Jclark-ctr) a:03Jclark-ctr [14:54:30] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:986658|zhwikinews: update wordmark (T353792)]] (duration: 09m 11s) [14:54:34] T353792: Logo of zh WikiNews has background color instead of alpha channel (visible in Minerva) - https://phabricator.wikimedia.org/T353792 [14:54:49] anzx: both should be live now, I've cleared some caches [14:55:25] (03CR) 10Muehlenhoff: [C: 03+2] mwlog: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987415 (owner: 10Muehlenhoff) [14:55:38] !log UTC afternoon backport window done [14:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:10] TheresNoTime: Thank you [14:57:09] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:31] (03PS1) 10Hashar: Remove archiva.wikimedia.org [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987434 (https://phabricator.wikimedia.org/T333465) [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1500) [15:01:52] (03CR) 10Hashar: [C: 03+2] Remove archiva.wikimedia.org [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987434 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [15:02:26] (03Merged) 10jenkins-bot: Remove archiva.wikimedia.org [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987434 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [15:09:02] (03PS1) 10Muehlenhoff: releases: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987436 [15:20:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987436 (owner: 10Muehlenhoff) [15:22:17] (03PS1) 10Stevemunene: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987437 [15:24:50] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987437 (owner: 10Stevemunene) [15:26:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:35:35] (03CR) 10Volans: "couple of suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:39:36] !log rebuild md RAIDs after disk swap T353324 [15:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:39] T353324: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 [15:53:10] (03PS1) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [15:58:14] (03PS1) 10Muehlenhoff: nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) [15:58:41] (03CR) 10CI reject: [V: 04-1] nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [15:59:25] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [15:59:55] (03CR) 10Milimetric: [C: 03+2] edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987437 (owner: 10Stevemunene) [16:01:02] (03Merged) 10jenkins-bot: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987437 (owner: 10Stevemunene) [16:01:58] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10WMDE-leszek) [16:02:13] (03PS2) 10Muehlenhoff: nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) [16:09:40] (03PS2) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [16:10:01] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp3066.esams.wmnet} and A:cp [16:10:32] (03PS1) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [16:11:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp3066.esams.wmnet} and A:cp [16:12:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [16:16:38] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4050.ulsfo.wmnet,cp4051.ulsfo.wmnet} and A:cp [16:16:39] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [16:19:40] (03Abandoned) 10Andrea Denisse: quickdatacopy: Add support to open files with O_NOATIME [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [16:20:11] 10SRE, 10Patch-For-Review: Rsync quickdatacopy copies files with atime creating a huge number of iops and a slow sync - https://phabricator.wikimedia.org/T329695 (10andrea.denisse) 05Open→03Declined [16:20:58] (03PS2) 10Clément Goubert: prometheus-php-fpm-exporter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987440 [16:21:34] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (0310 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:22:45] (03PS3) 10Ebernhardson: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 [16:22:50] (03PS1) 10Clément Goubert: prometheus-apache-exporter: Update to bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) [16:22:51] !log stevemunene@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:23:21] !log stevemunene@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:23:21] (03PS8) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [16:23:46] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q2), 10Sustainability (Incident Followup): Add a default rsyslog destination for all sites - https://phabricator.wikimedia.org/T336448 (10andrea.denisse) 05In progress→03Resolved [16:24:02] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:24:07] !log stevemunene@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [16:24:31] (03PS9) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [16:24:32] !log stevemunene@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [16:25:01] !log stevemunene@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [16:25:24] !log stevemunene@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [16:26:11] !log stevemunene@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:26:30] !log stevemunene@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:26:36] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:26:51] !log stevemunene@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [16:27:09] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:27:11] !log stevemunene@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [16:27:31] !log stevemunene@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [16:27:52] !log stevemunene@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [16:28:41] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:29:09] (03PS10) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [16:30:24] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:30:45] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon) Prep work (make sure all fs' mounted correctly) done on ms-be10[76-82], three nodes had an FS unhappy from the install. [16:30:54] (03PS11) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [16:32:09] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4050.ulsfo.wmnet,cp4051.ulsfo.wmnet} and A:cp [16:34:37] (03PS1) 10Marostegui: update_zarcillo: Push to the repo [software] - 10https://gerrit.wikimedia.org/r/987445 [16:35:17] (03CR) 10Marostegui: "I have been using this for ages, but I just realise I never sent it to the repo" [software] - 10https://gerrit.wikimedia.org/r/987445 (owner: 10Marostegui) [16:42:39] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon) Similarly, 3 unhappy nodes in codfw from the install, all done now. [16:45:23] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp [16:48:56] (03PS2) 10Clément Goubert: prometheus-apache-exporter: Update to bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987443 (https://phabricator.wikimedia.org/T283861) [16:54:28] (03PS1) 10MVernon: swift: add new storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/987448 (https://phabricator.wikimedia.org/T353149) [16:56:15] 10SRE, 10ops-codfw: Inbound interface errors - ge-6/0/22 - db2099 - https://phabricator.wikimedia.org/T354155 (10Papaul) 05Open→03Resolved a:03Papaul We know about this [16:59:27] (03CR) 10Hashar: "That fails with:" [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [17:01:58] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10colewhite) p:05Triage→03High The cluster will remain in a degraded state until replacements are installed. Please replace the failed disks as soon as possible. Thanks! [17:05:49] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:06:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp [17:12:02] (03PS3) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [17:14:14] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10Papaul) @colewhite unfortunately this serer is out of warranty since 2023-11-18. You have 1 options 1- See if we have some 1.92 TB SSD's from decom nodes that we can use 2- Purchase 1.92TB SSD's [17:16:17] (03CR) 10Marostegui: [C: 03+1] swift: add new storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/987448 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [17:17:01] (03PS1) 10Hashar: [WMF] make wmf-build.py unbuffered [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987453 [17:23:48] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. Sun 31 Dec 2023 19:43:14 Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. Sun 31 Dec 2023 19:... [17:25:05] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10colewhite) I'm ok with either of those options. If you happen to have one available, let's use it. [17:26:18] (03PS2) 10Hashar: [WMF] make wmf-build.py unbuffered [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987453 [17:28:37] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [17:32:03] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on logstash2033:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [17:32:48] (03CR) 10Hashar: [C: 03+2] [WMF] make wmf-build.py unbuffered [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987453 (owner: 10Hashar) [17:35:48] (03CR) 10BCornwall: [C: 03+2] wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [17:35:50] (03CR) 10BCornwall: [V: 03+2 C: 03+2] wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [17:37:24] (03CR) 10Urbanecm: [C: 03+1] Add "patroller" user group to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [17:38:34] (03Merged) 10jenkins-bot: [WMF] make wmf-build.py unbuffered [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/987453 (owner: 10Hashar) [17:45:11] Hi, I can't login on MediaWiki.org, although I'm supposed to be logged in automatically, because I'm logged in on Serbian Wikipedia. [17:45:17] I'm having this error [17:45:18] [528758a3-5ec2-470f-92fb-a70423728575] 2024-01-03 17:44:16: Fatal exception of type "Wikimedia\Assert\PreconditionException" [17:49:19] Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'enwiki' [17:49:54] Presumably T353620 [17:49:55] T353620: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'commonswiki' - https://phabricator.wikimedia.org/T353620 [17:51:47] I tried to login to Wikidata, and it worked. And I'm logged in on MediaWiki.org now as well. [17:52:04] Reedy: seems unrelated and possibly caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/974718 [17:55:16] Kizule: please file a task [17:55:44] same stacktrace as https://phabricator.wikimedia.org/T353620 [17:55:46] seems rollback worthy to me [17:55:56] seems like they're happening more since yesterday, mostly on k8s hosts [17:56:47] thcipriani: I see a different one, coming from CentralAuthUser->localUserData [17:57:23] taavi: oh, you're right [17:57:30] I don't think that I need to fill the another one, which will be closed as a duplicate eventually. T353620 it is. [17:57:31] T353620: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'commonswiki' - https://phabricator.wikimedia.org/T353620 [17:57:50] lemme file it with phatality [17:58:30] There's two stack traces for that has [17:58:53] hash [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1800) [18:00:49] yeah, seeing as it's only affecting wmf.12, i will rollback [18:01:37] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:15] yeah I think we have two separate issues with similar error messages. this one is new and a blocker and probably caused by the CA patch I just linked [18:02:41] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987457 (https://phabricator.wikimedia.org/T350088) [18:02:43] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987457 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [18:03:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987457 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [18:03:46] !log dduvall@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.12 refs T350088 [18:03:50] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [18:04:36] alright, filed one for CentralAuth: https://phabricator.wikimedia.org/T354298 [18:05:53] should https://phabricator.wikimedia.org/T353620 be a blocker as well? [18:06:25] that does not seem to be new in this train, can it block it? [18:07:21] those are two different issue, but they have the same underlying cause [18:07:27] ah i see. wmf.9 i won't block on it then [18:07:46] i guess making BlockUtils wiki-aware should do the trick, i can try to write something [18:07:49] I'd stick with the centralauth one for now, that seems to be the cause of the increased error rates. If they're both closed with the fix for the blocker: that's good. [18:09:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:11:10] !log dduvall@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.12 refs T350088 (duration: 07m 23s) [18:11:14] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [18:12:22] ok. wmf.12 is back on testwikis only. someone let me know if testwikis should be rolled back as well [18:12:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:12:40] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) After swapping the CPU and DIMM now i am getting ` CPU 2 MEM012 VPP PG voltage is outside of range. Wed 03 Jan 2024 17:43:07 CPU 1 MEM012 VPP PG voltage is outside of range. ` and the server is n... [18:13:06] thanks dduvall , happy new year <3 [18:19:52] :D thanks for wrangling [18:20:08] and thanks Kizule for the report [18:21:03] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) ` Create Dispatch: Success You have successfully submitted request SR182660280. ` [18:26:16] (03PS1) 10Dzahn: contint: use php7.4 on bullseye just like on buster [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) [18:26:44] (03CR) 10CI reject: [V: 04-1] contint: use php7.4 on bullseye just like on buster [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:27:02] !log running an essentially no-op phab2002 deploy [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:10] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:17] !log brennen@deploy2002 Started deploy [phabricator/deployment@369e797]: deploy to phab2002 for T334519 [18:27:20] T334519: upgrade phab (phorge) hosts to bullseye - https://phabricator.wikimedia.org/T334519 [18:27:25] (03PS2) 10Dzahn: contint: use php7.4 on bullseye just like on buster [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) [18:27:44] !log brennen@deploy2002 Finished deploy [phabricator/deployment@369e797]: deploy to phab2002 for T334519 (duration: 00m 27s) [18:31:06] dduvall: You're welcome [18:32:10] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:38:11] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10Papaul) 05Open→03Resolved a:03Papaul @colewhite disk replaced [18:42:22] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on logstash2033:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [18:44:58] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10Dzahn) @MoritzMuehlenhoff When this is done, should I expect that there will be a `component/icu67` in distro `wikimedia-bullseye` just like there is now in distro `wikimedia-buster`? I am just wondering... [18:47:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10Papaul) disk replaced [18:48:15] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [18:53:23] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) Bullseye has ICU 67 as the default ICU version, as such on Bullseye there will only be component/php74 and nothing else. [18:56:34] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10Dzahn) Gotcha! thank you. I will amend my patch accordingly. [18:59:14] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/984645 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [18:59:35] (03PS2) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/984645 (https://phabricator.wikimedia.org/T351074) [19:00:05] dduvall and dancy: Your horoscope predicts another Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1900). [19:00:05] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T1900). nyaa~ [19:06:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10WMDE-leszek) [19:07:40] (03PS3) 10Dzahn: contint: use php7.4 on bullseye just like on buster [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) [19:07:43] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:08:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2436.codfw.wmnet with OS bullseye [19:10:07] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2437.codfw.wmnet with OS bullseye [19:11:08] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [19:11:38] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1378.eqiad.wmnet with OS bullseye [19:14:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/987402/1024/people1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/987402 (owner: 10Muehlenhoff) [19:16:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:18:04] (03CR) 10ArielGlenn: add foundationwiki to the list of central auth login wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [19:18:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS bullseye [19:19:02] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2442.codfw.wmnet with OS bullseye [19:21:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1379.eqiad.wmnet with OS bullseye [19:22:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1380.eqiad.wmnet with OS bullseye [19:25:03] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "looks good. what it actually does:" [puppet] - 10https://gerrit.wikimedia.org/r/987402 (owner: 10Muehlenhoff) [19:25:32] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [19:26:10] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [19:26:34] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2436.codfw.wmnet with reason: host reimage [19:28:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "tested rsync on people1004 pulling from people2002 - works fine" [puppet] - 10https://gerrit.wikimedia.org/r/987402 (owner: 10Muehlenhoff) [19:28:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [19:28:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:28:41] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2437.codfw.wmnet with reason: host reimage [19:29:19] PROBLEM - Check systemd state on kubernetes2024 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:43] (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987406 (owner: 10Muehlenhoff) [19:31:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [19:32:22] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2443.codfw.wmnet with OS bullseye [19:33:22] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2450.codfw.wmnet with OS bullseye [19:33:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2437.codfw.wmnet with reason: host reimage [19:33:53] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw2451.codfw.wmnet with OS bullseye [19:34:35] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1381.eqiad.wmnet with OS bullseye [19:35:10] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1382.eqiad.wmnet with OS bullseye [19:35:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [19:35:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage [19:35:41] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1383.eqiad.wmnet with OS bullseye [19:36:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2436.codfw.wmnet with reason: host reimage [19:36:32] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage [19:36:48] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [19:37:05] (03CR) 10Dzahn: [C: 03+2] "fyi, this is what this did:" [puppet] - 10https://gerrit.wikimedia.org/r/987406 (owner: 10Muehlenhoff) [19:38:30] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage [19:39:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [19:39:44] !log root@doc2002: /usr/local/sbin/sync-doc-host-data-sync after gerrit:987406 [19:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [19:44:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage [19:46:03] (03PS4) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [19:49:04] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [19:49:23] PROBLEM - Check systemd state on mw1465 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [19:49:55] PROBLEM - Host mw2440 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:04] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [19:50:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage [19:50:57] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: improve reporting for duplicate PTRs [puppet] - 10https://gerrit.wikimedia.org/r/987464 [19:51:24] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage [19:51:48] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage [19:51:50] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage [19:52:10] (03PS1) 10Andrew Bogott: wmcs admin scripts: run everything through Black [puppet] - 10https://gerrit.wikimedia.org/r/987465 [19:52:18] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [19:52:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [19:52:35] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: improve reporting for duplicate PTRs [puppet] - 10https://gerrit.wikimedia.org/r/987464 (owner: 10Andrew Bogott) [19:52:54] RECOVERY - Host mw2440 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [19:53:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2437.codfw.wmnet with OS bullseye [19:55:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage [19:55:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2440.codfw.wmnet with OS bullseye [19:57:07] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage [19:57:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [19:57:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2436.codfw.wmnet with OS bullseye [19:59:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw1465 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:00:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [20:04:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2442.codfw.wmnet with OS bullseye [20:06:51] PROBLEM - Host mw2451 is DOWN: PING CRITICAL - Packet loss = 100% [20:08:57] RECOVERY - Host mw2451 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [20:09:12] (KubernetesCalicoDown) firing: mw2451.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2451.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:11:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2451.codfw.wmnet with OS bullseye [20:12:19] PROBLEM - Host mw2450 is DOWN: PING CRITICAL - Packet loss = 100% [20:13:03] RECOVERY - MD RAID on logstash2033 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:14:09] RECOVERY - Host mw2450 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [20:14:12] (KubernetesCalicoDown) firing: (2) mw2450.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:15:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2443.codfw.wmnet with OS bullseye [20:17:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2450.codfw.wmnet with OS bullseye [20:19:12] (KubernetesCalicoDown) resolved: (2) mw2450.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:27:03] RECOVERY - OpenSearch health check for shards on 9200 on logstash2033 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 677, active_shards: 1561, relocating_shards: 6, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [20:27:03] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:27:10] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:28:56] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:41] (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync service to use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987406 (owner: 10Muehlenhoff) [20:32:13] (03CR) 10Eevans: [C: 03+1] swift: add new storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/987448 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [20:32:33] (03PS3) 10Samtar: Add "patroller" user group to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [20:34:03] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1377.eqiad.wmnet with OS bullseye [20:37:07] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1378.eqiad.wmnet with OS bullseye [20:43:55] 10SRE, 10ops-codfw: Degraded RAID on logstash2033 - https://phabricator.wikimedia.org/T354249 (10colewhite) I rebuilt the array and the host is now allocating shards. Thank you so much! [20:45:18] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1379.eqiad.wmnet with OS bullseye [20:47:42] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1380.eqiad.wmnet with OS bullseye [20:59:47] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1381.eqiad.wmnet with OS bullseye [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T2100). [21:00:04] NovemLinguae: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] * TheresNoTime can deploy [21:00:33] here [21:00:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [21:01:33] (03Merged) 10jenkins-bot: Add "patroller" user group to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [21:02:02] !log samtar@deploy2002 Started scap: Backport for [[gerrit:986200|Add "patroller" user group to testwiki (T354063)]] [21:02:08] T354063: Add "patroller" user group to testwiki - https://phabricator.wikimedia.org/T354063 [21:04:10] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1382.eqiad.wmnet with OS bullseye [21:04:55] NovemLinguae: are you all set up with WikimediaDebug installed? :) [21:05:02] yes, ready :) [21:05:15] James_F: regarding https://gerrit.wikimedia.org/r/c/mediawiki/core/+/987460: would you prefer to revert the causing centralauth patch on wmf.12 and let the proper fix go in with next weeks train? [21:06:09] !log samtar@deploy2002 novemlinguae and samtar: Backport for [[gerrit:986200|Add "patroller" user group to testwiki (T354063)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:21] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1383.eqiad.wmnet with OS bullseye [21:06:40] NovemLinguae: okay, you can select any of the mwdebug hosts and test this [21:08:09] tested, works. looks good to me [21:08:14] ack [21:08:17] !log samtar@deploy2002 novemlinguae and samtar: Continuing with sync [21:08:57] (03PS1) 10Kamila Součková: Set MW API servers to insetup to fix failed reimage [puppet] - 10https://gerrit.wikimedia.org/r/987487 (https://phabricator.wikimedia.org/T351074) [21:09:48] kamila_: I had 6 k8s nodes "failed to pull the multiversion image" during that step of scap just now, seems to be the ones that failed to reimage - https://phabricator.wikimedia.org/P54519 [21:10:56] (03CR) 10Kamila Součková: [C: 03+2] Set MW API servers to insetup to fix failed reimage [puppet] - 10https://gerrit.wikimedia.org/r/987487 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [21:11:30] NovemLinguae: this sync will take a few minutes, and then once complete you can confirm the patch works in production :) [21:11:51] groovy :) [21:12:25] TheresNoTime: yep, that'd be me, sorry [21:12:47] np, just letting you know in case its helpful/unexpected :) [21:12:57] but... did the 7th one work? :D [21:13:21] I think it's both helpful and unexpected :D [21:13:26] k7s and k8s worked perfectly /joke [21:13:30] :D [21:13:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:16] (03CR) 10Dzahn: [C: 03+1] Set MW API servers to insetup to fix failed reimage [puppet] - 10https://gerrit.wikimedia.org/r/987487 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [21:14:22] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:986200|Add "patroller" user group to testwiki (T354063)]] (duration: 12m 19s) [21:14:31] T354063: Add "patroller" user group to testwiki - https://phabricator.wikimedia.org/T354063 [21:14:38] NovemLinguae: live on prod [21:14:51] ok, turning off debug extension and re-testing [21:16:05] (03PS9) 10Zabe: Update mediawiki/mediawiki-codesniffer to 42.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986825 [21:16:22] (03CR) 10CI reject: [V: 04-1] Update mediawiki/mediawiki-codesniffer to 42.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986825 (owner: 10Zabe) [21:16:35] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:16:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:33] (03PS10) 10Zabe: Update mediawiki/mediawiki-codesniffer to 42.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986825 [21:17:36] looks good [21:17:46] awesome, all done then :) [21:18:25] thanks! crossing "participate in a mediawiki config deployment" off my bucket list :) [21:18:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [21:18:51] painless! (until it isn't :p) [21:19:21] !log UTC late backport window done [21:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:16] jouncebot: nowandnext [21:22:16] For the next 0 hour(s) and 37 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T2100) [21:22:16] In 0 hour(s) and 37 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T2200) [21:22:33] (03CR) 10Zabe: [C: 03+2] Update mediawiki/mediawiki-codesniffer to 42.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986825 (owner: 10Zabe) [21:23:27] (03Merged) 10jenkins-bot: Update mediawiki/mediawiki-codesniffer to 42.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986825 (owner: 10Zabe) [21:23:59] !log zabe@deploy2002 Started scap: Backport for [[gerrit:986825|Update mediawiki/mediawiki-codesniffer to 42.0.0]] [21:27:38] !log zabe@deploy2002 zabe: Backport for [[gerrit:986825|Update mediawiki/mediawiki-codesniffer to 42.0.0]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:49] !log zabe@deploy2002 zabe: Continuing with sync [21:32:31] (03PS1) 10Dzahn: alerting: replace serviceops-collab with new team name [puppet] - 10https://gerrit.wikimedia.org/r/987488 [21:33:15] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [21:33:47] (03PS2) 10Dzahn: alerting: replace serviceops-collab with new team name [puppet] - 10https://gerrit.wikimedia.org/r/987488 [21:34:12] (KubernetesCalicoDown) firing: mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1378.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:34:34] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:986825|Update mediawiki/mediawiki-codesniffer to 42.0.0]] (duration: 10m 34s) [21:36:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [21:39:31] PROBLEM - Host mw1379 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:35] PROBLEM - Host mw1380 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2087.codfw.wmnet with OS bullseye [21:44:13] (KubernetesCalicoDown) firing: (3) mw1378.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:44:59] (03PS5) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [21:47:44] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: broken reimage [21:48:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: broken reimage [21:48:53] (03PS6) 10Hashar: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) [21:52:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1377.eqiad.wmnet with OS bullseye [21:59:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2087.codfw.wmnet with reason: host reimage [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240103T2200) [22:03:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2087.codfw.wmnet with reason: host reimage [22:09:03] (03PS1) 10Dzahn: puppet: add quota module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) [22:09:45] (03CR) 10Dzahn: [C: 04-1] "arr, no,I did not want to add it as a submodule.." [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [22:17:03] (03PS2) 10Dzahn: puppet: add quota module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) [22:20:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2087.codfw.wmnet with OS bullseye [22:21:26] (03PS1) 10Peter Fischer: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 [22:24:05] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [22:26:00] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:03pfischer [22:30:51] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:32:10] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:36:23] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [22:36:42] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1378.eqiad.wmnet with OS bullseye [22:37:00] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1379.eqiad.wmnet with OS bullseye [22:37:10] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:37:53] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1380.eqiad.wmnet with OS bullseye [22:37:57] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1381.eqiad.wmnet with OS bullseye [22:38:01] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1382.eqiad.wmnet with OS bullseye [22:38:08] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1383.eqiad.wmnet with OS bullseye [22:40:01] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [22:41:03] RECOVERY - Host mw1379 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:42:55] RECOVERY - Host mw1380 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:51:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [22:51:29] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [22:52:18] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [22:52:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [22:52:38] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [22:52:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [22:54:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1378.eqiad.wmnet with reason: host reimage [22:54:11] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1381.eqiad.wmnet with reason: host reimage [22:57:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1382.eqiad.wmnet with reason: host reimage [22:59:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [22:59:25] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1380.eqiad.wmnet with reason: host reimage [23:01:22] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:02:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1383.eqiad.wmnet with reason: host reimage [23:07:20] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1379.eqiad.wmnet with OS bullseye [23:10:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1381.eqiad.wmnet with OS bullseye [23:12:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1378.eqiad.wmnet with OS bullseye [23:14:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1382.eqiad.wmnet with OS bullseye [23:14:22] (03PS1) 10Hashar: Gerrit 3.6.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987498 (https://phabricator.wikimedia.org/T309870) [23:15:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1380.eqiad.wmnet with OS bullseye [23:18:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1383.eqiad.wmnet with OS bullseye [23:24:27] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [23:24:35] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [23:33:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:39:12] (KubernetesCalicoDown) firing: mw1377.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1377.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:39:19] (03PS1) 10BCornwall: pybal: Disable Pint promql/series checks [alerts] - 10https://gerrit.wikimedia.org/r/987499 (https://phabricator.wikimedia.org/T353760) [23:50:15] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1379.eqiad.wmnet with reason: failed reimage, will fix tomorrow [23:50:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1379.eqiad.wmnet with reason: failed reimage, will fix tomorrow [23:50:26] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on mw1379.eqiad.wmnet with reason: failed reimage, will fix tomorrow [23:50:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on mw1379.eqiad.wmnet with reason: failed reimage, will fix tomorrow