[00:00:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0000) [00:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T384592)', diff saved to https://phabricator.wikimedia.org/P73112 and previous config saved to /var/cache/conftool/dbconfig/20250204-000010-marostegui.json [00:13:48] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:27:55] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:04] (03PS1) 10BryanDavis: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) [00:34:54] (03PS2) 10BryanDavis: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1116889 [00:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1116889 (owner: 10TrainBranchBot) [00:41:56] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1116889 (owner: 10TrainBranchBot) [00:47:23] (03CR) 10Andrew Bogott: [C:03+1] "Janis, any concerns about this tiny refactor?" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [01:08:00] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1116890 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1116890 (owner: 10TrainBranchBot) [01:11:54] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entries for new frack nodes - pt1979@cumin2002" [01:11:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entries for new frack nodes - pt1979@cumin2002" [01:12:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:17:45] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1116890 (owner: 10TrainBranchBot) [01:25:37] (03PS1) 10Scott French: mw-(api-ext|web): scale next to 15% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116891 (https://phabricator.wikimedia.org/T383845) [01:25:39] (03PS1) 10Scott French: Enroll 25% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116892 (https://phabricator.wikimedia.org/T383845) [01:25:41] (03PS1) 10Scott French: mw-api-int: serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116893 (https://phabricator.wikimedia.org/T383845) [01:34:43] RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 77%, RTA = 0.37 ms [01:41:07] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [01:58:48] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:01:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:02:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1053 [02:04:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1053 [02:04:39] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1054 [02:06:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1054 [02:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.15 [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1116896 (https://phabricator.wikimedia.org/T382366) [02:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.15 [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1116896 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [02:17:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.15 [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1116896 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [02:27:55] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:42:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:42:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1053 [02:43:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1053 [02:48:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10519642 (10Jhancock.wm) [02:49:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:52:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:55:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1053 [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0300) [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1053 [03:16:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1054 [03:19:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1054 [03:41:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10519708 (10Jhancock.wm) @VRiley-WMF can you verify the SNs for these two servers. they should end with either 391 or 392. @elukey these servers are not... [03:52:47] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:53:47] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0400) [04:01:43] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116903 (https://phabricator.wikimedia.org/T382366) [04:01:44] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116903 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [04:02:30] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116903 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [04:02:57] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.15 refs T382366 [04:03:00] T382366: 1.44.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T382366 [04:07:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:07:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:48] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:23:15] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/480a78a2efc69a16fc171eb742f2ba0ed36d45734eb53ff61e26c680f23c05af/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:43:15] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T384592)', diff saved to https://phabricator.wikimedia.org/P73114 and previous config saved to /var/cache/conftool/dbconfig/20250204-045922-marostegui.json [04:59:25] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0500) [05:01:50] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.15 refs T382366 (duration: 58m 53s) [05:01:52] T382366: 1.44.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T382366 [05:04:51] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.12 (duration: 04m 49s) [05:14:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P73115 and previous config saved to /var/cache/conftool/dbconfig/20250204-051429-marostegui.json [05:23:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:24:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10519765 (10Papaul) @Jhancock.wm i checked the serial number on 1053, it is the serial number ending with 392. Trying re-running the cookbook with the --en... [05:29:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P73116 and previous config saved to /var/cache/conftool/dbconfig/20250204-052936-marostegui.json [05:39:49] (03PS1) 10KartikMistry: Update cxserver to 2025-02-03-095815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116912 (https://phabricator.wikimedia.org/T377966) [05:44:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T384592)', diff saved to https://phabricator.wikimedia.org/P73117 and previous config saved to /var/cache/conftool/dbconfig/20250204-054443-marostegui.json [05:44:47] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:44:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [05:45:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T384592)', diff saved to https://phabricator.wikimedia.org/P73118 and previous config saved to /var/cache/conftool/dbconfig/20250204-054505-marostegui.json [05:45:24] (03PS1) 10KartikMistry: Update Recommendation API to 2025-01-31-105046-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116914 [06:00:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:31] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-01-31-105046-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116914 (owner: 10KartikMistry) [06:08:51] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-01-31-105046-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116914 (owner: 10KartikMistry) [06:26:31] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [06:32:05] (03CR) 10Ladsgroup: "It definitely should happen in later patches, but since the file is much smaller now, we can merge it back to IS.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115518 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [06:34:32] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:44:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2227 to clone db2209', diff saved to https://phabricator.wikimedia.org/P73119 and previous config saved to /var/cache/conftool/dbconfig/20250204-064425-marostegui.json [06:45:08] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2227.codfw.wmnet onto db2209.codfw.wmnet [06:46:55] (03CR) 10Marostegui: [C:03+1] dbbackups: Update grants for misc hosts other than m1 [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [06:56:47] (03PS1) 10Kevin Bazira: changeprop: add liftwing article-country source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117063 (https://phabricator.wikimedia.org/T382295) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0700). [07:00:38] ?? [07:10:58] its a bot federico3 don't worry, it is just highlighting you during maintenance windows [07:12:12] Thanks, I'm aware of the bot, just puzzled by the phrasing 😅 [07:12:29] aha it took me a while also :D [07:23:31] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:32:08] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:48:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1013.eqiad.wmnet with reason: Rebuild tables [07:49:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Rebuild tables [07:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1197', diff saved to https://phabricator.wikimedia.org/P73120 and previous config saved to /var/cache/conftool/dbconfig/20250204-075440-marostegui.json [07:54:54] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1197.eqiad.wmnet [07:55:23] (03PS1) 10Urbanecm: Babel: Merge back into IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117105 (https://phabricator.wikimedia.org/T385239) [07:55:56] (03CR) 10Urbanecm: [C:04-2] "Agreed. Uploaded I7e9a49c7b7ef050ff9ab4ba860aab0eb1946150b." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115518 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [07:56:08] (03CR) 10CI reject: [V:04-1] Babel: Merge back into IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117105 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [07:56:16] ...and CI hates it! [07:56:37] (03PS2) 10Urbanecm: Babel: Merge back into IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117105 (https://phabricator.wikimedia.org/T385239) [07:56:55] (03PS1) 10Marostegui: installserver: Remove db1251 [puppet] - 10https://gerrit.wikimedia.org/r/1117106 [08:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0800). [08:00:05] cyndy: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:48] am here :) [08:00:59] i can deploy today! [08:01:12] let's do this :) [08:01:24] (03CR) 10Urbanecm: [C:03+2] Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [08:01:34] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1197.eqiad.wmnet [08:01:37] (03PS4) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) [08:01:40] (03CR) 10Urbanecm: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [08:01:44] (03CR) 10Urbanecm: [C:03+2] Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [08:01:48] (03PS3) 10Urbanecm: [Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115791 (https://phabricator.wikimedia.org/T378527) [08:01:51] (03CR) 10Urbanecm: [C:03+2] [Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115791 (https://phabricator.wikimedia.org/T378527) (owner: 10Urbanecm) [08:02:02] (03CR) 10Marostegui: [C:03+2] installserver: Remove db1251 [puppet] - 10https://gerrit.wikimedia.org/r/1117106 (owner: 10Marostegui) [08:02:05] Cyndywikime: i am going to deploy both patches at the same time [08:02:07] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Index rebuild [08:02:32] @urbanecm- ok :) [08:02:36] (03Merged) 10jenkins-bot: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [08:02:39] (03Merged) 10jenkins-bot: [Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115791 (https://phabricator.wikimedia.org/T378527) (owner: 10Urbanecm) [08:04:07] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]] [08:04:11] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [08:04:12] T378527: Surfacing structured tasks: Populate Add Link suggestions for more articles - https://phabricator.wikimedia.org/T378527 [08:10:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1236', diff saved to https://phabricator.wikimedia.org/P73121 and previous config saved to /var/cache/conftool/dbconfig/20250204-081056-marostegui.json [08:11:35] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Index rebuild [08:11:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2220', diff saved to https://phabricator.wikimedia.org/P73122 and previous config saved to /var/cache/conftool/dbconfig/20250204-081151-marostegui.json [08:12:12] !log root@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 12:00:00 on db2220.codfw.wmnet with reason: Index rebuild [08:12:22] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2220.codfw.wmnet [08:12:34] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1236.eqiad.wmnet [08:13:17] !log urbanecm@deploy2002 urbanecm, cyndywikime: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:21] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [08:13:21] T378527: Surfacing structured tasks: Populate Add Link suggestions for more articles - https://phabricator.wikimedia.org/T378527 [08:13:49] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:15:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2227.codfw.wmnet onto db2209.codfw.wmnet [08:15:45] Cyndywikime: i'll be running refreshLinkRecommendations for testwiki on debug now, to verify it is able to start [08:16:26] @urbanecm, okay [08:17:13] (03PS3) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) [08:17:53] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1236.eqiad.wmnet [08:18:11] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2220.codfw.wmnet [08:18:35] Cyndywikime: proceeds as expected [08:18:46] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Index rebuild [08:18:47] Cyndywikime: anything else we should test before proceeding? [08:18:53] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Index rebuild [08:19:02] (03CR) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [08:19:16] @urbanecm, nothing comes to mind at the moment. We can proceed [08:19:22] !log urbanecm@deploy2002 urbanecm, cyndywikime: Continuing with sync [08:19:24] proceeding [08:20:24] a/11 [08:20:27] err :) [08:21:16] (03PS1) 10Marostegui: Revert "db2209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117107 [08:21:46] (03CR) 10Marostegui: [C:03+2] Revert "db2209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117107 (owner: 10Marostegui) [08:22:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73123 and previous config saved to /var/cache/conftool/dbconfig/20250204-082210-root.json [08:22:39] (03PS1) 10Urbanecm: Move link recommendation minimum tasks per topic to PHP configuration [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117109 (https://phabricator.wikimedia.org/T383714) [08:23:03] Cyndywikime: i just realised it's tuesday, so the patch would arrive to eswiki/frwiki by Thursday only. let's backport it as well, so that we have more of monitoring time. [08:23:09] (03CR) 10Urbanecm: [C:03+2] Move link recommendation minimum tasks per topic to PHP configuration [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117109 (https://phabricator.wikimedia.org/T383714) (owner: 10Urbanecm) [08:23:17] +2'ing to start CI [08:23:23] okay, sounds good to me :) [08:23:54] (for some reason, yesterday, i was convinced today would be Thursday.) [08:24:09] (03CR) 10Ayounsi: Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [08:25:24] ETA: 8 min? for a wmf/* patch? that's unusually quick [08:27:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73124 and previous config saved to /var/cache/conftool/dbconfig/20250204-082738-root.json [08:29:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2039 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73125 and previous config saved to /var/cache/conftool/dbconfig/20250204-082912-marostegui.json [08:29:27] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2039.codfw.wmnet [08:29:33] (03CR) 10CI reject: [V:04-1] Move link recommendation minimum tasks per topic to PHP configuration [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117109 (https://phabricator.wikimedia.org/T383714) (owner: 10Urbanecm) [08:29:44] (03CR) 10Urbanecm: [C:03+2] "..." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117109 (https://phabricator.wikimedia.org/T383714) (owner: 10Urbanecm) [08:30:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]] (duration: 25m 56s) [08:30:07] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [08:30:08] T378527: Surfacing structured tasks: Populate Add Link suggestions for more articles - https://phabricator.wikimedia.org/T378527 [08:30:10] waiting on CI [08:31:36] @urbanecm , monitoring as well :) [08:31:43] ty! [08:34:55] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2039.codfw.wmnet [08:36:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Rebuild tables [08:37:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73127 and previous config saved to /var/cache/conftool/dbconfig/20250204-083716-root.json [08:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73128 and previous config saved to /var/cache/conftool/dbconfig/20250204-084052-root.json [08:41:59] @urbanecm, i can see some unrelated CI errors => cannot delete ‘/cache/npm/_cacache/content-v2/sha512/5f/36’: No such file or directory`, [08:42:37] Cyndywikime: i don't see that error in the second run? [08:42:43] note i restarted CI in the middle, as a job failed [08:42:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73129 and previous config saved to /var/cache/conftool/dbconfig/20250204-084244-root.json [08:44:35] hhhmmm, okay [08:45:13] (03PS1) 10Bking: dse-k8s-eqiad: Create sidecar controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117112 (https://phabricator.wikimedia.org/T385551) [08:46:02] (03Merged) 10jenkins-bot: Move link recommendation minimum tasks per topic to PHP configuration [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117109 (https://phabricator.wikimedia.org/T383714) (owner: 10Urbanecm) [08:46:20] here we go! [08:46:42] Ah, seen it! All good [08:46:46] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1117109|Move link recommendation minimum tasks per topic to PHP configuration (T383714)]] [08:46:49] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [08:52:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73130 and previous config saved to /var/cache/conftool/dbconfig/20250204-085221-root.json [08:52:26] !log push pfw policies T384885 [08:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:38] 06SRE, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Site-requests: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10519970 (10thiemowmde) [08:53:14] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1117109|Move link recommendation minimum tasks per topic to PHP configuration (T383714)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:53:17] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [08:53:58] trying at mwdebug [08:54:13] https://www.irccloud.com/pastebin/4Pl6IZaN/ [08:55:05] https://es.wikipedia.org/wiki/Especial:NewcomerTasksInfo [08:55:34] food-and-drink has 119 recommendations [08:55:37] 1881 new ones seems accurate [08:55:45] seems the config picked [08:55:50] yep [08:55:53] Cyndywikime: let's go? [08:55:55] nice [08:56:08] yes, let's go. Thank you :) [08:56:13] !log urbanecm@deploy2002 urbanecm: Continuing with sync [08:56:16] ty, proceeding [08:57:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Repooling after clone - T383760 [08:57:24] T383760: dbctl: expose diff via API in a more structured way - https://phabricator.wikimedia.org/T383760 [08:57:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73131 and previous config saved to /var/cache/conftool/dbconfig/20250204-085749-root.json [08:58:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73132 and previous config saved to /var/cache/conftool/dbconfig/20250204-085828-root.json [09:00:04] jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T0900). [09:00:54] 👋 morning, looks like backports are still running, please let me know when I can go ahead with the train [09:01:35] jnuche: will do, i have one last sync finishing up [09:02:01] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add mediawiki.article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:02:27] (03PS1) 10KartikMistry: Make MT limit more strict by 10% in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) [09:02:50] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:03:22] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1169.eqiad.wmnet [09:03:23] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1169.eqiad.wmnet [09:04:14] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117109|Move link recommendation minimum tasks per topic to PHP configuration (T383714)]] (duration: 17m 28s) [09:04:17] T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration - https://phabricator.wikimedia.org/T383714 [09:04:27] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Create sidecar controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117112 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:04:28] Cyndywikime: should be all done [09:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520019 (10phaultfinder) [09:04:41] jnuche: over to you! thank you for the patience. [09:05:02] urbanecm: no worries, ty! [09:05:08] Thanks Martin for co-ordinating this :) [09:06:01] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Create sidecar controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117112 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:06:26] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117114 (https://phabricator.wikimedia.org/T382366) [09:06:27] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117114 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:07:11] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117114 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:07:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73134 and previous config saved to /var/cache/conftool/dbconfig/20250204-090726-root.json [09:09:53] (03Merged) 10jenkins-bot: dse-k8s-eqiad: Create sidecar controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117112 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:12:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:12:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73135 and previous config saved to /var/cache/conftool/dbconfig/20250204-091254-root.json [09:13:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:13:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73136 and previous config saved to /var/cache/conftool/dbconfig/20250204-091334-root.json [09:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520040 (10phaultfinder) [09:15:31] (03PS1) 10Marostegui: Revert "dbtools: Drop depool and repool bashes" [software] - 10https://gerrit.wikimedia.org/r/1117116 [09:15:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:16:25] (03CR) 10Marostegui: [C:03+2] Revert "dbtools: Drop depool and repool bashes" [software] - 10https://gerrit.wikimedia.org/r/1117116 (owner: 10Marostegui) [09:16:57] (03Merged) 10jenkins-bot: Revert "dbtools: Drop depool and repool bashes" [software] - 10https://gerrit.wikimedia.org/r/1117116 (owner: 10Marostegui) [09:18:03] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.15 refs T382366 [09:18:06] T382366: 1.44.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T382366 [09:21:15] (03PS1) 10Jgiannelos: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117118 [09:22:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73137 and previous config saved to /var/cache/conftool/dbconfig/20250204-092232-root.json [09:24:37] (03CR) 10Jgiannelos: [C:03+2] kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117118 (owner: 10Jgiannelos) [09:25:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520076 (10phaultfinder) [09:25:55] (03Merged) 10jenkins-bot: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117118 (owner: 10Jgiannelos) [09:28:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73138 and previous config saved to /var/cache/conftool/dbconfig/20250204-092759-root.json [09:28:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73139 and previous config saved to /var/cache/conftool/dbconfig/20250204-092838-root.json [09:30:15] (03PS1) 10Bking: dse-k8s-eqiad: deploy sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117120 (https://phabricator.wikimedia.org/T385551) [09:32:42] (03PS1) 10Slyngshede: idp-test fallback [dns] - 10https://gerrit.wikimedia.org/r/1117121 [09:33:37] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [09:34:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [09:38:08] !log mwmaint2002: Kill `mediawiki_job_growthexperiments-refreshLinkRecommendations-s6[6640]` to pick new config (T378527) [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:10] T378527: Surfacing structured tasks: Populate Add Link suggestions for more articles - https://phabricator.wikimedia.org/T378527 [09:39:02] !log slyngshede@dns1004 START - running authdns-update [09:39:06] (03CR) 10Slyngshede: [C:03+2] idp-test fallback [dns] - 10https://gerrit.wikimedia.org/r/1117121 (owner: 10Slyngshede) [09:39:13] !log slyngshede@dns1004 START - running authdns-update [09:39:21] !log slyngshede@dns1004 START - running authdns-update [09:39:38] (03PS1) 10Arturo Borrero Gonzalez: profile::puppetserver::wmcs: refresh apt pin for openjdk [puppet] - 10https://gerrit.wikimedia.org/r/1117122 (https://phabricator.wikimedia.org/T385553) [09:41:14] !log slyngshede@dns1004 END - running authdns-update [09:43:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73140 and previous config saved to /var/cache/conftool/dbconfig/20250204-094344-root.json [09:44:32] FIRING: SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s7.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:12] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: deploy sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117120 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:49:43] (03PS1) 10Bking: dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) [09:50:14] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:50:19] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:50:19] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: deploy sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117120 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:50:31] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 33MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [09:53:22] (03CR) 10CI reject: [V:04-1] dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [09:54:22] (03Merged) 10jenkins-bot: dse-k8s-eqiad: deploy sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117120 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [10:02:09] (03CR) 10Revi: [C:03+1] kowikisource: Add Draft(_talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [10:02:19] (03PS4) 10Elukey: admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) [10:02:21] damn wrong touch [10:04:43] (03PS2) 10Revi: kowikisource: Add Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) [10:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73141 and previous config saved to /var/cache/conftool/dbconfig/20250204-100807-root.json [10:13:39] (03CR) 10Hnowlan: [C:03+1] changeprop: add liftwing article-country source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117063 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [10:13:45] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [10:13:48] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [10:13:50] !log depool maps1006 from all services to run perf tests [10:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:53] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [10:15:23] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:15:34] (03CR) 10Ladsgroup: [C:03+1] Babel: Merge back into IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117105 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [10:15:58] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [10:16:12] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs4008 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) [10:16:26] (03CR) 10Ladsgroup: [C:03+1] Babel: Remove config that is now in community configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115518 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [10:18:17] (03CR) 10Klausman: [C:03+1] admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:20:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:20:38] (03CR) 10Elukey: [C:03+2] admin_ng: set new Docker images for Knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116826 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:21:56] (03CR) 10Effie Mouzeli: [C:03+2] mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) (owner: 10Scott French) [10:22:52] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:23:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73143 and previous config saved to /var/cache/conftool/dbconfig/20250204-102313-root.json [10:23:32] (03Merged) 10jenkins-bot: mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) (owner: 10Scott French) [10:24:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:25:07] (03CR) 10Jcrespo: "There is a mistake here, but unrelated to the patch, in which striker_toolsbeta database has been renamed to strikertoolsbeta. Will be cor" [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [10:25:31] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:27:10] checking [10:29:32] RESOLVED: SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s7.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520299 (10phaultfinder) [10:32:19] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:32:56] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:34:05] (03CR) 10Klausman: [C:03+1] changeprop: add liftwing article-country source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117063 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [10:34:18] the doc backups are running which are huge, there is a bit of clogging but everything is normal [10:38:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73144 and previous config saved to /var/cache/conftool/dbconfig/20250204-103818-root.json [10:38:47] (03PS1) 10Effie Mouzeli: mw-jobrunner: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117139 (https://phabricator.wikimedia.org/T383845) [10:38:49] (03PS1) 10Effie Mouzeli: mw-parsoid: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117140 (https://phabricator.wikimedia.org/T383845) [10:40:45] (03CR) 10Vgutierrez: "looking good, `modules/profile/spec/classes/profile_cache_varnish_frontend_spec.rb` needs to be updated to set `single_backend` to `true`" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [10:44:22] 06SRE, 06Infrastructure-Foundations, 10netops: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10520327 (10ayounsi) For (1) we can have the `sre.ganeti.addnode` cookbook call the PuppetDBImport script towards the end. What do you and @MoritzMu... [10:44:22] !log foreachwiki sql.php /srv/mediawiki/php-1.44.0-wmf.14/sql/mysql/patch-collation.sql (T384592) [10:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:24] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:53:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73145 and previous config saved to /var/cache/conftool/dbconfig/20250204-105302-root.json [10:53:15] (03PS2) 10Effie Mouzeli: shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038) [10:53:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73146 and previous config saved to /var/cache/conftool/dbconfig/20250204-105323-root.json [10:54:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2040 for kernel reboot', diff saved to https://phabricator.wikimedia.org/P73147 and previous config saved to /var/cache/conftool/dbconfig/20250204-105411-marostegui.json [10:54:28] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2040.codfw.wmnet [10:54:48] (03CR) 10Effie Mouzeli: [C:03+1] mw-api-int: serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116893 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:55:06] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 25% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116892 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:55:35] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale next to 15% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116891 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:55:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1229 for index rebuild', diff saved to https://phabricator.wikimedia.org/P73148 and previous config saved to /var/cache/conftool/dbconfig/20250204-105546-marostegui.json [10:56:12] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1229.eqiad.wmnet [10:56:41] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs4008 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) [10:56:41] (03PS1) 10Vgutierrez: hiera,liberica: Add missing role options [puppet] - 10https://gerrit.wikimedia.org/r/1117153 (https://phabricator.wikimedia.org/T384477) [10:57:44] (03CR) 10Clément Goubert: [C:03+1] mw-jobrunner: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117139 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:57:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117153 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:57:52] (03CR) 10Clément Goubert: [C:03+1] mw-parsoid: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117140 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:58:12] (03CR) 10Clément Goubert: [C:03+1] shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [10:59:50] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2040.codfw.wmnet [10:59:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:00:05] effie and swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC mid-day) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1100). [11:01:16] 10ops-eqsin, 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#10520381 (10ayounsi) Let's decom it and focus our efforts on spinning up VMs instead (T385560). It needs to be removed from the list on https://github.com/wikimedia/... [11:01:52] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1229.eqiad.wmnet [11:01:55] 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10520385 (10ayounsi) Let's decom it and focus our efforts on spinning up VMs instead (T385560). It needs to be removed from the list on https://github.com/wikimedia/operations-pupp... [11:03:29] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Index rebuild [11:07:54] (03PS1) 10Ayounsi: Remove eqiad and eqsin ripe atlas from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1117154 (https://phabricator.wikimedia.org/T382518) [11:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73149 and previous config saved to /var/cache/conftool/dbconfig/20250204-110808-root.json [11:08:29] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117154 (https://phabricator.wikimedia.org/T382518) (owner: 10Ayounsi) [11:08:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73150 and previous config saved to /var/cache/conftool/dbconfig/20250204-110830-root.json [11:11:45] (03CR) 10Hnowlan: [C:03+2] jobqueue: bump ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115899 (https://phabricator.wikimedia.org/T385273) (owner: 10Hnowlan) [11:13:31] (03Merged) 10jenkins-bot: jobqueue: bump ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115899 (https://phabricator.wikimedia.org/T385273) (owner: 10Hnowlan) [11:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73151 and previous config saved to /var/cache/conftool/dbconfig/20250204-111337-root.json [11:14:15] jouncebot: nowandnext [11:14:15] For the next 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1100) [11:14:15] In 1 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1300) [11:15:58] (03CR) 10Effie Mouzeli: [C:03+2] shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:17:01] (03Merged) 10jenkins-bot: shellbox: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116837 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:17:25] (03CR) 10Effie Mouzeli: [C:03+2] mw-jobrunner: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117139 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:17:30] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:17:40] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:17:54] (03PS1) 10Vgutierrez: liberica: Reload config using SIGHUP [puppet] - 10https://gerrit.wikimedia.org/r/1117157 (https://phabricator.wikimedia.org/T384477) [11:18:35] (03Merged) 10jenkins-bot: mw-jobrunner: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117139 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:18:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117157 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:18:48] (03PS1) 10Elukey: Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 [11:18:58] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:19:35] (03CR) 10Ilias Sarantopoulos: [C:03+1] Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [11:19:49] (03CR) 10Elukey: "root@build2001:/srv/images/production-images# docker image ls | grep pytorch" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [11:20:02] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:20:15] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [11:20:27] (03CR) 10Elukey: Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [11:20:59] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:21:39] (03Abandoned) 10Jcrespo: mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [11:21:57] (03PS2) 10Elukey: Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 [11:22:05] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:22:14] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73152 and previous config saved to /var/cache/conftool/dbconfig/20250204-112313-root.json [11:23:20] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520474 (10phaultfinder) [11:24:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:25:55] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [11:25:56] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117140 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:26:16] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [11:27:10] (03Merged) 10jenkins-bot: mw-parsoid: serve 1% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117140 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:28:15] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [11:28:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73153 and previous config saved to /var/cache/conftool/dbconfig/20250204-112844-root.json [11:31:00] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [11:32:09] (03PS1) 10Lucas Werkmeister (WMDE): Avoid PHP Notice on missing entityschema-meta-tags [extensions/EntitySchema] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117162 (https://phabricator.wikimedia.org/T385272) [11:32:22] (03PS1) 10Lucas Werkmeister (WMDE): Avoid PHP Notice on missing entityschema-meta-tags [extensions/EntitySchema] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117163 (https://phabricator.wikimedia.org/T385272) [11:32:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/EntitySchema] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117163 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [11:33:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:33:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/EntitySchema] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117162 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [11:33:22] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [11:33:27] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] profile::puppetserver::wmcs: refresh apt pin for openjdk [puppet] - 10https://gerrit.wikimedia.org/r/1117122 (https://phabricator.wikimedia.org/T385553) (owner: 10Arturo Borrero Gonzalez) [11:33:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:34:58] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:37:03] (03CR) 10Vgutierrez: [C:03+2] site,swift: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:38:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73154 and previous config saved to /var/cache/conftool/dbconfig/20250204-113818-root.json [11:39:32] (03CR) 10Lucas Werkmeister (WMDE): kowikisource: Add Draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [11:39:47] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:41:54] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73155 and previous config saved to /var/cache/conftool/dbconfig/20250204-114350-root.json [11:44:32] FIRING: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:37] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Use FIDO2 ssh keys for production access - https://phabricator.wikimedia.org/T385229#10520528 (10taavi) FWIW, this is possible as of today, my account for example is exclusively using them for Bullseye+ hosts. There's a specific `buster_ssh_keys` admin... [11:47:08] RESOLVED: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:38] (03PS1) 10Elukey: admin_ng: allow tuning securityContext on ml-staging's knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117164 (https://phabricator.wikimedia.org/T369493) [11:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1227 for index rebuild', diff saved to https://phabricator.wikimedia.org/P73156 and previous config saved to /var/cache/conftool/dbconfig/20250204-114808-marostegui.json [11:48:11] !log deploying new backup grants for matomo and analytics_meta T383902 [11:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:13] T383902: Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them - https://phabricator.wikimedia.org/T383902 [11:48:29] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1227.eqiad.wmnet [11:49:23] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [11:49:26] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [11:52:07] (03PS1) 10Reedy: Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117169 (https://phabricator.wikimedia.org/T385169) [11:52:20] (03PS1) 10Reedy: Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117170 (https://phabricator.wikimedia.org/T385169) [11:53:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73157 and previous config saved to /var/cache/conftool/dbconfig/20250204-115323-root.json [11:53:26] (03PS1) 10Máté Szabó: Remove flag $wgSecurePollSingleTransferableVoteEnabled [extensions/SecurePoll] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117171 (https://phabricator.wikimedia.org/T376930) [11:54:02] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1227.eqiad.wmnet [11:54:03] (03CR) 10Máté Szabó: [C:03+1] Remove flag wgSecurePollSingleTransferableVoteEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115917 (https://phabricator.wikimedia.org/T376930) (owner: 10Mimurawil) [11:54:28] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Index rebuild [11:54:33] (03PS1) 10Máté Szabó: Remove flag $wgSecurePollSingleTransferableVoteEnabled [extensions/SecurePoll] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117172 (https://phabricator.wikimedia.org/T376930) [11:54:44] (03CR) 10Kevin Bazira: [C:03+2] changeprop: add liftwing article-country source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117063 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [11:54:51] !log uploaded pybal 1.15.15 to apt.wm.o (bullseye-wikimedia) T373027 [11:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:54] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [11:55:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/SecurePoll] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117171 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [11:55:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/SecurePoll] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117172 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [11:55:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115917 (https://phabricator.wikimedia.org/T376930) (owner: 10Mimurawil) [11:55:55] (03Merged) 10jenkins-bot: changeprop: add liftwing article-country source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117063 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [11:56:58] (03CR) 10Elukey: [C:03+2] admin_ng: allow tuning securityContext on ml-staging's knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117164 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:58:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73158 and previous config saved to /var/cache/conftool/dbconfig/20250204-115855-root.json [11:59:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:59:15] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:00:40] jouncebot: now [12:00:41] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [12:03:02] !log upgrading pybal on eqsin - T373027 [12:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:05] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [12:03:20] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: fix DNS recursor check [puppet] - 10https://gerrit.wikimedia.org/r/1117173 [12:04:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:04:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: fix DNS recursor check [puppet] - 10https://gerrit.wikimedia.org/r/1117173 (owner: 10Arturo Borrero Gonzalez) [12:04:44] ^^ that's me [12:04:51] !log manually dropped 2.5.1rocm6.2-1-20250202 on build2001 - T385531 [12:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:54] T385531: Image publishing via docker-pkg on build2001 repeatedly failing - https://phabricator.wikimedia.org/T385531 [12:07:58] !log manually executed docker-system-prune-dangling.service on build2001 [12:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:36] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs500[4-5]*} and A:lvs (T373027) [12:10:38] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [12:10:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73160 and previous config saved to /var/cache/conftool/dbconfig/20250204-121056-root.json [12:11:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs500[4-5]*} and A:lvs (T373027) [12:13:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T384592)', diff saved to https://phabricator.wikimedia.org/P73161 and previous config saved to /var/cache/conftool/dbconfig/20250204-121331-marostegui.json [12:13:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:13:48] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:14:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73162 and previous config saved to /var/cache/conftool/dbconfig/20250204-121400-root.json [12:14:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2222 for index rebuild', diff saved to https://phabricator.wikimedia.org/P73163 and previous config saved to /var/cache/conftool/dbconfig/20250204-121450-marostegui.json [12:14:59] !log upgrading pybal on secondary load balancers - T373027 [12:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:05] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2222.codfw.wmnet [12:15:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2014.codfw.wmnet,lvs6003.drmrs.wmnet,lvs1020.eqiad.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet} and A:lvs (T373027) [12:15:43] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [12:15:43] (03CR) 10Klausman: [C:03+1] Revert "amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117158 (owner: 10Elukey) [12:16:07] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:17:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2014.codfw.wmnet,lvs6003.drmrs.wmnet,lvs1020.eqiad.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet} and A:lvs (T373027) [12:18:55] !log upgrading pybal on low-traffic load balancers - T373027 [12:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:19:41] that's been the case for the last 5 days :) [12:19:45] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:20:00] triggered again by a pybal restart [12:20:33] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic (T373027) [12:20:38] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2222.codfw.wmnet [12:21:35] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic (T373027) [12:21:37] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [12:22:45] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:23:07] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Index rebuild [12:23:33] !log upgrading pybal on high-traffic2 load balancers - T373027 [12:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:16] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2012.codfw.wmnet,lvs6002.drmrs.wmnet,lvs1018.eqiad.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet} and A:lvs (T373027) [12:24:28] (03CR) 10Revi: kowikisource: Add Draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [12:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520734 (10phaultfinder) [12:25:35] (03PS2) 10Bking: dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) [12:25:55] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2012.codfw.wmnet,lvs6002.drmrs.wmnet,lvs1018.eqiad.wmnet,lvs3009.esams.wmnet,lvs7002.magru.wmnet} and A:lvs (T373027) [12:26:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73164 and previous config saved to /var/cache/conftool/dbconfig/20250204-122602-root.json [12:26:34] !log upgrading pybal on high-traffic1 load balancers - T373027 [12:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:37] T373027: Add support for mh-port scheduler flag on pybal - https://phabricator.wikimedia.org/T373027 [12:26:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:27:11] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2011.codfw.wmnet,lvs6001.drmrs.wmnet,lvs1017.eqiad.wmnet,lvs3008.esams.wmnet,lvs7001.magru.wmnet} and A:lvs (T373027) [12:28:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P73165 and previous config saved to /var/cache/conftool/dbconfig/20250204-122838-marostegui.json [12:28:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2011.codfw.wmnet,lvs6001.drmrs.wmnet,lvs1017.eqiad.wmnet,lvs3008.esams.wmnet,lvs7001.magru.wmnet} and A:lvs (T373027) [12:28:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:31:00] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:31:16] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:32:57] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [12:33:09] (03PS6) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [12:33:43] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [12:33:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:34:59] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [12:35:18] (03Merged) 10jenkins-bot: dse-k8s-eqiad: Add query-service ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117126 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:38:18] !log deploying new backup grants for ES hosts T383902 [12:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:21] T383902: Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them - https://phabricator.wikimedia.org/T383902 [12:39:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:40:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:41:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73166 and previous config saved to /var/cache/conftool/dbconfig/20250204-124107-root.json [12:43:36] (03PS5) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) [12:43:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P73167 and previous config saved to /var/cache/conftool/dbconfig/20250204-124345-marostegui.json [12:43:48] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [12:44:25] (03PS1) 10Bking: dse-k8s-eqiad: fix query-service namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117180 (https://phabricator.wikimedia.org/T385551) [12:44:40] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10520811 (10Ahonc) more errors: Request served via cp6014 cp6014, Varnish XID 739548500 Error: 503, Backend fetch failed at Tue, 04 Feb 2025 12:38:16 GMT Request served via cp3069... [12:45:21] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: fix query-service namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117180 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:46:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73168 and previous config saved to /var/cache/conftool/dbconfig/20250204-124625-root.json [12:47:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1117181 (https://phabricator.wikimedia.org/T385576) [12:48:50] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: fix query-service namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117180 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:49:46] (03PS2) 10Jcrespo: dbbackups: Update grants for misc hosts other than m1 [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) [12:49:46] (03PS2) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) [12:49:46] (03PS1) 10Jcrespo: dbbackups: Fix m5 backup grant issues [puppet] - 10https://gerrit.wikimedia.org/r/1117182 (https://phabricator.wikimedia.org/T383902) [12:52:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [12:53:01] (03Merged) 10jenkins-bot: dse-k8s-eqiad: fix query-service namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117180 (https://phabricator.wikimedia.org/T385551) (owner: 10Bking) [12:53:17] (03CR) 10Jcrespo: [C:03+2] dbbackups: Update grants for misc hosts other than m1 [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [12:53:28] (03CR) 10Jcrespo: [C:03+2] "This is now deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1116845 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [12:54:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:54:48] (03CR) 10Andrew Bogott: [C:03+2] cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [12:55:02] (03PS2) 10Jcrespo: dbbackups: Fix m5 backup grant issues [puppet] - 10https://gerrit.wikimedia.org/r/1117182 (https://phabricator.wikimedia.org/T383902) [12:55:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73169 and previous config saved to /var/cache/conftool/dbconfig/20250204-125612-root.json [12:57:09] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bookworm [12:57:35] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bookworm [12:58:09] mmhh jouncebot's not here? [12:58:15] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [12:58:42] (03CR) 10Nikerabbit: Make MT limit more strict by 10% in Bhojpuri Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [12:58:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T384592)', diff saved to https://phabricator.wikimedia.org/P73170 and previous config saved to /var/cache/conftool/dbconfig/20250204-125852-marostegui.json [12:58:53] (03CR) 10Jcrespo: [C:03+2] "Deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1117182 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [12:59:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [12:59:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T384592)', diff saved to https://phabricator.wikimedia.org/P73171 and previous config saved to /var/cache/conftool/dbconfig/20250204-125913-marostegui.json [12:59:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:55] jouncebot: nowandnext [12:59:56] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [12:59:56] In 0 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1300) [12:59:59] godog: fixeded [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1300) [13:00:13] thank you Reedy <3 [13:00:30] (03PS3) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) [13:00:57] godog: of course, stashbot is also AWOL now [13:01:07] due to works going on in wmcs [13:01:09] (03CR) 10Jcrespo: "Is this ok to go? I am the service owner, but this is one of "your" databases (m1 production)." [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [13:01:23] Reedy: of course :( [13:01:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:01:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73172 and previous config saved to /var/cache/conftool/dbconfig/20250204-130130-root.json [13:01:50] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [13:01:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:01:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [13:01:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:02:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:02:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:02:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:03:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:04:18] !log aborrero@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudgw1004.eqiad.wmnet with OS bookworm [13:07:42] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bullseye [13:09:39] !log upgrade poolcounter-prometheus-exporter to 0.1.2 - T333947 [13:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:42] T333947: poolcounter-exporter upgrade - https://phabricator.wikimedia.org/T333947 [13:11:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73173 and previous config saved to /var/cache/conftool/dbconfig/20250204-131118-root.json [13:12:09] (03PS4) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) [13:12:09] (03PS1) 10Jcrespo: dbbackups: Add mpic_production and mpic_staging to scheduled backups [puppet] - 10https://gerrit.wikimedia.org/r/1117187 (https://phabricator.wikimedia.org/T385565) [13:12:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520902 (10phaultfinder) [13:12:09] (03CR) 10Jcrespo: "See change also affecting oozie db." [puppet] - 10https://gerrit.wikimedia.org/r/1117187 (https://phabricator.wikimedia.org/T385565) (owner: 10Jcrespo) [13:12:40] (03PS1) 10Andrew Bogott: cloudgw100[34]: specify puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1117189 (https://phabricator.wikimedia.org/T382356) [13:13:25] (03PS2) 10Jcrespo: dbbackups: Add mpic_production and mpic_staging to scheduled backups [puppet] - 10https://gerrit.wikimedia.org/r/1117187 (https://phabricator.wikimedia.org/T385565) [13:13:28] (03PS2) 10Andrew Bogott: cloudgw100[34]: specify puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1117189 (https://phabricator.wikimedia.org/T382356) [13:14:17] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [13:16:32] (03PS1) 10Filippo Giunchedi: team-sre: add PoolcounterDown alert [alerts] - 10https://gerrit.wikimedia.org/r/1117190 [13:16:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73174 and previous config saved to /var/cache/conftool/dbconfig/20250204-131636-root.json [13:16:46] (03PS3) 10Andrew Bogott: cloudgw100[34]: specify puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1117189 (https://phabricator.wikimedia.org/T382356) [13:17:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117189 (https://phabricator.wikimedia.org/T382356) (owner: 10Andrew Bogott) [13:17:17] (03CR) 10Filippo Giunchedi: "Replaces icinga check in I71b18994340" [alerts] - 10https://gerrit.wikimedia.org/r/1117190 (owner: 10Filippo Giunchedi) [13:17:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [13:17:52] (03CR) 10Filippo Giunchedi: "Alerts based on poolcounter_up is at Id8efdf411f now" [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:19:43] (03CR) 10Máté Szabó: [C:03+1] profile: remove obsolete poolcounter icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:21:14] (03Abandoned) 10Andrew Bogott: cloudgw100[34]: specify puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1117189 (https://phabricator.wikimedia.org/T382356) (owner: 10Andrew Bogott) [13:23:04] (03CR) 10CDanis: [C:03+1] wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:23:38] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage [13:27:16] 14SRE-Sprint-Week-Sustainability-March2023, 06MediaWiki-Platform-Team, 10PoolCounter, 06serviceops, 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#10520962 (10fgiunchedi) [13:27:34] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage [13:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10520975 (10phaultfinder) [13:29:52] (03CR) 10Máté Szabó: [C:03+1] team-sre: add PoolcounterDown alert [alerts] - 10https://gerrit.wikimedia.org/r/1117190 (owner: 10Filippo Giunchedi) [13:31:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73175 and previous config saved to /var/cache/conftool/dbconfig/20250204-133141-root.json [13:35:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1002.eqiad.wmnet with OS bookworm [13:37:08] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:32] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:59] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10520998 (10ayounsi) Just coming back, I'm also curious about the upcoming FR-tech changes, is that discussed somewhere ? Other than that, +1 on everything that has... [13:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521010 (10phaultfinder) [13:44:51] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1004.eqiad.wmnet with OS bullseye [13:44:57] (03CR) 10Lucas Werkmeister (WMDE): kowikisource: Add Draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [13:46:35] (03PS3) 10Revi: kowikisource: Add Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) [13:46:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73176 and previous config saved to /var/cache/conftool/dbconfig/20250204-134646-root.json [13:49:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [13:50:25] (03CR) 10Filippo Giunchedi: [C:03+2] team-sre: add PoolcounterDown alert [alerts] - 10https://gerrit.wikimedia.org/r/1117190 (owner: 10Filippo Giunchedi) [13:50:32] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 33MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [13:50:48] (03CR) 10Filippo Giunchedi: [C:03+2] profile: remove obsolete poolcounter icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:06] (03PS1) 10Arturo Borrero Gonzalez: prometheus: kernel-messages: ignore ACPI region handler message [puppet] - 10https://gerrit.wikimedia.org/r/1117195 [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:58] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: kernel-messages: ignore ACPI region handler message [puppet] - 10https://gerrit.wikimedia.org/r/1117195 (owner: 10Arturo Borrero Gonzalez) [14:01:27] jouncebot: nowandnext [14:01:32] ded [14:01:36] F [14:01:53] ANOTHER ONE [14:03:09] jouncebot: nowandnext [14:03:09] For the next 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1400) [14:03:09] In 1 hour(s) and 56 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1600) [14:04:06] seems like the clock ticked while it was `ded` so no usual pingz I guess? [14:04:25] yeah [14:04:30] almost certainly [14:04:31] let's see.. [14:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521051 (10phaultfinder) [14:05:02] Lucas_WMDE: urbanecm: TheresNoTime: ^ lovely backport time [14:05:06] oh, already! [14:05:07] o/ [14:05:19] o/ [14:05:28] * revi waiting for patch so he can go to bed [14:05:39] I can deploy! [14:05:46] * urbanecm has a meeting [14:05:52] let’s start with kowikisource drafts then [14:06:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [14:06:47] (03Merged) 10jenkins-bot: kowikisource: Add Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [14:06:56] 2002 I guess... [14:07:21] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1115377|kowikisource: Add Draft namespace (T385162)]] [14:09:02] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1115377|kowikisource: Add Draft namespace (T385162)]] # re-log from 14:07 UTC [14:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:05] T385162: Add Draft: Namespace for kowikisource - https://phabricator.wikimedia.org/T385162 [14:10:22] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [14:10:54] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [14:12:57] * Lucas_WMDE has not seen kevinbazira yet btw [14:14:08] why is build-and-push-container-images being so slow o_O [14:14:11] started 14:07:50… [14:14:15] * Lucas_WMDE peeks at the log [14:14:56] not much info there, apparently it’s been running docker-pusher for four minutes [14:15:22] I guess this step is generally going to be somewhat slower until the PHP 8.1 migration is finished, because it currently has to build twice as many images? (PHP 7.4 and 8.1) [14:17:40] Lucas_WMDE: maybe someone updated the images, and they're rebuilt from scratch (rather than from cache)? [14:18:17] I hope not :/ [14:18:31] if I read the output correctly, the “normal” image finished pushing, the -81 image is still being pushed [14:18:52] push harder dear server /me cheers [14:19:18] (03PS3) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:19:21] (03CR) 10Andrew Bogott: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:19:29] ok *finally* it’s making progress [14:19:35] and… building something new again apparently :S [14:20:12] Finished build-and-push-container-images (duration: 12m 07s) [14:20:39] I think it might have been a hiccup in the registry tbh [14:21:25] the log shows no reason for the nine-minute delay (exactly) between running docker-pusher and the registry responding (IIUC) [14:21:41] Related to the cloud stuff upsetting bots? [14:21:47] hm [14:21:49] maybe? [14:21:50] no idea [14:21:51] I don't really know what lives where in that stack [14:22:39] (03PS4) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:22:41] (03CR) 10Andrew Bogott: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:24:46] revi: you should be able to start testing on k8s-mwdebug now [14:25:35] (03CR) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:25:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, revi: Backport for [[gerrit:1115377|kowikisource: Add Draft namespace (T385162)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:47] `editnotice-118`, LGTM https://sharex.revicdn.net/2025/02/firefox_FHpKRt6KgH.png [14:25:49] T385162: Add Draft: Namespace for kowikisource - https://phabricator.wikimedia.org/T385162 [14:26:13] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, revi: Continuing with sync [14:26:16] ok, thanks! [14:30:02] RIP stashbot [14:30:32] nooooh [14:32:05] back from the grave [14:32:23] and propagating somewhat slowly, it seems :P [14:32:41] hm? [14:33:09] propagating in what sense? [14:33:19] turned off the k8s-mwdebug, saw nstab-main [14:33:28] ah, I see [14:33:39] yeah the deployment is still running [14:33:41] then refresh gave me nstab-draft [14:33:43] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10521197 (10Papaul) @ayounsi welcome back. the only information we have right now is frack having 1 more rack in the new cage and it will have all the fundraising ana... [14:33:54] so yup [14:33:59] I thought that was still referring to stashbot ^^ [14:34:04] aha [14:36:27] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115377|kowikisource: Add Draft namespace (T385162)]] (duration: 29m 05s) [14:36:29] T385162: Add Draft: Namespace for kowikisource - https://phabricator.wikimedia.org/T385162 [14:36:32] yay [14:36:38] (duration: 29m 05s) [14:36:38] wow [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:45] kevinbazira: your patch is next :) [14:36:49] and thanks! [14:36:50] let’s hope the deployment goes quicker ;_; [14:36:59] * revi signs off to bed [14:37:01] revi: np, good night :D [14:37:16] Lucas_WMDE: o/ [14:37:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:37:51] mszabo: are you still around for a while after the end of the window? [14:37:56] I doubt we’ll finish all the deployments in time [14:38:02] (03Merged) 10jenkins-bot: EventStreamConfig: Add mediawiki.article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:38:04] but there’s nothing immediately afterwards so we could just keep going [14:38:10] Lucas_WMDE: yea, I'm fine with that [14:38:13] ok :) [14:38:14] thanks :) [14:38:31] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]] [14:38:33] I’ll +2 my backports already [14:38:35] T382295: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295 [14:38:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117163 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [14:38:51] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117162 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [14:42:22] (03CR) 10Elukey: "I'd personally use a different resource name, so we avoid to have two sysctl config files doing the same thing. In theory we should $ensur" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:43:34] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kevinbazira: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:43:37] T382295: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295 [14:45:15] kevinbazira: can you test the change on mwdebug? [14:45:33] Lucas_WMDE: checking ... [14:46:14] Lucas_WMDE: o/ that patch is difficult to test, it is needed by eventgate to be able to configure some streams [14:46:19] I think it is safe to proceed [14:46:26] ok, thanks [14:46:32] jouncebot died ;_; [14:46:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kevinbazira: Continuing with sync [14:46:56] Lucas_WMDE: I can not see the `mediawiki.article_country_prediction_change.v1` stream listed in beta `https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs` [14:46:57] could be a cache issue on my end [14:47:33] hm, I don’t see a difference there either [14:47:50] wait, that’s a beta link [14:48:01] kevinbazira: on https://meta.wikimedia.org/w/api.php?action=streamconfigs it shows up [14:48:54] (03Merged) 10jenkins-bot: Avoid PHP Notice on missing entityschema-meta-tags [extensions/EntitySchema] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117163 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [14:50:06] (03Merged) 10jenkins-bot: Avoid PHP Notice on missing entityschema-meta-tags [extensions/EntitySchema] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117162 (https://phabricator.wikimedia.org/T385272) (owner: 10Lucas Werkmeister (WMDE)) [14:50:22] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10521269 (10Vgutierrez) thanks for your report @Ahonc, I'm seeing your requests hitting a timeout after 125 seconds, this matches with our `idle_send_timeout` configuration (https:... [14:51:31] jouncebot: nowandnext [14:51:31] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1400) [14:51:31] In 1 hour(s) and 8 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1600) [14:51:37] jouncebot: thx :) [14:53:14] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1117157 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:54:54] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]] (duration: 16m 23s) [14:54:58] T382295: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295 [14:55:15] deploying my two backports next [14:55:38] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1117163|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]], [[gerrit:1117162|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]] [14:55:41] T385272: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T385272 [14:57:19] Lucas_WMDE: thanks. now I can see the `mediawiki.article_country_prediction_change.v1` stream listed in beta `https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs` [14:58:37] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10521325 (10Ahonc) > tracert uk.wikipedia.org Tracing route to dyna.wikimedia.org [185.15.59.224] over a maximum of 30 hops: 1 4 ms 2 ms 2 ms router.lan [192.168.8... [14:59:02] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1117163|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]], [[gerrit:1117162|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:07] testing… [14:59:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:59:29] lgtm \o/ [14:59:40] (tested with the two links from https://phabricator.wikimedia.org/T385272#10516387) [15:00:12] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1117157 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:00:51] (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs4008 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:02:59] I can see the `mediawiki.article_country_prediction_change.v1` stream listed in prod too `https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings`. thanks again Lucas! [15:03:27] np :) [15:04:26] (03PS2) 10Herron: wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) [15:05:09] (03PS1) 10Jon Harald Søby: Add sourceswiki to $wgImportSources for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) [15:05:23] Lucas_WMDE: All done? [15:05:36] Reedy: I still have two pending after Lucas' patches [15:05:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [15:05:49] Reedy: yeah, I’m still deploying and then it would be mszabo’s turn ideally [15:06:02] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10521377 (10Ahonc) from other network: > tracert uk.wikipedia.org Tracing route to dyna.wikimedia.org [185.15.59.224] over a maximum of 30 hops: 1 3 ms 2 ms 1 ms... [15:06:15] (but mine should be done soon, k8s is at 94%) [15:06:30] mszabo: pffft [15:06:35] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117163|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]], [[gerrit:1117162|Avoid PHP Notice on missing entityschema-meta-tags (T385272)]] (duration: 10m 56s) [15:06:38] T385272: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T385272 [15:06:38] (* now finished at 94% due to T375514) [15:06:39] T375514: Scap k8s deployments sometimes report final progress < 100% - https://phabricator.wikimedia.org/T375514 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:03] I’m done deploying, Reedy is your deployment urgent? [15:08:14] Nah [15:08:24] I was just going to squash the Echo PHP 8.1 logspam :) [15:08:36] Go ahead with mszabo's [15:08:45] thanks [15:08:55] mszabo: want to self-service? ^^ [15:09:04] sure thing [15:09:14] Reedy: I suspected as much ;P [15:09:25] (thanks for clearing out the PHP 8.1 deprecations <3) [15:09:42] danke for the CR too! [15:11:18] (03CR) 10Herron: [C:03+2] wmnet: add codfw aux-k8s-etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1116867 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:11:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/SecurePoll] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117172 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [15:11:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/SecurePoll] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117171 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [15:12:24] !log herron@dns1004 START - running authdns-update [15:14:31] !log herron@dns1004 END - running authdns-update [15:14:45] (03Merged) 10jenkins-bot: Remove flag $wgSecurePollSingleTransferableVoteEnabled [extensions/SecurePoll] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117172 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [15:15:26] (03Merged) 10jenkins-bot: Remove flag $wgSecurePollSingleTransferableVoteEnabled [extensions/SecurePoll] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117171 (https://phabricator.wikimedia.org/T376930) (owner: 10Máté Szabó) [15:15:59] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1117172|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]], [[gerrit:1117171|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]] [15:16:02] T376930: remove $wgSecurePollSingleTransferableVoteEnabled? - https://phabricator.wikimedia.org/T376930 [15:20:47] (03CR) 10Ssingh: [C:03+1] "Looks good, compared against lvs::balancer." [puppet] - 10https://gerrit.wikimedia.org/r/1117153 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:20:52] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1117172|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]], [[gerrit:1117171|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:21:33] (03CR) 10Vgutierrez: [C:03+2] hiera,liberica: Add missing role options [puppet] - 10https://gerrit.wikimedia.org/r/1117153 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:21:54] (03PS1) 10Elukey: knative: fix patch command and backport for patches for PSS migration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117207 (https://phabricator.wikimedia.org/T369493) [15:23:06] !log mszabo@deploy2002 mszabo: Continuing with sync [15:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521449 (10phaultfinder) [15:29:45] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117172|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]], [[gerrit:1117171|Remove flag $wgSecurePollSingleTransferableVoteEnabled (T376930)]] (duration: 13m 46s) [15:29:48] T376930: remove $wgSecurePollSingleTransferableVoteEnabled? - https://phabricator.wikimedia.org/T376930 [15:32:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115917 (https://phabricator.wikimedia.org/T376930) (owner: 10Mimurawil) [15:32:39] !log reimaging lvs4008 as a liberica LB - T384477 [15:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:42] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [15:32:46] (03Merged) 10jenkins-bot: Remove flag wgSecurePollSingleTransferableVoteEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115917 (https://phabricator.wikimedia.org/T376930) (owner: 10Mimurawil) [15:33:05] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs4008 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1117138 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:33:15] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:15] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1115917|Remove flag wgSecurePollSingleTransferableVoteEnabled (T376930)]] [15:33:15] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:05] ^ expected [15:34:30] (03CR) 10Reedy: [C:03+2] Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117169 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:34:32] (03CR) 10Reedy: [C:03+2] Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117170 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521463 (10phaultfinder) [15:34:57] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:35:17] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bookworm [15:36:54] !log mszabo@deploy2002 mimurawil, mszabo: Backport for [[gerrit:1115917|Remove flag wgSecurePollSingleTransferableVoteEnabled (T376930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:36:57] T376930: remove $wgSecurePollSingleTransferableVoteEnabled? - https://phabricator.wikimedia.org/T376930 [15:37:25] !log mszabo@deploy2002 mimurawil, mszabo: Continuing with sync [15:37:35] (03PS1) 10Herron: wmnet: aux-k8s-etcd codfw fix typo [dns] - 10https://gerrit.wikimedia.org/r/1117209 [15:39:41] FIRING: JobUnavailable: Reduced availability for job pybal in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:55] (03PS1) 10Reedy: Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117210 (https://phabricator.wikimedia.org/T385588) [15:39:59] (03CR) 10Reedy: [C:03+2] Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117210 (https://phabricator.wikimedia.org/T385588) (owner: 10Reedy) [15:40:06] (03PS1) 10Reedy: Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117211 (https://phabricator.wikimedia.org/T385588) [15:40:11] (03CR) 10Reedy: [C:03+2] Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117211 (https://phabricator.wikimedia.org/T385588) (owner: 10Reedy) [15:42:15] (03CR) 10Herron: [C:03+2] wmnet: aux-k8s-etcd codfw fix typo [dns] - 10https://gerrit.wikimedia.org/r/1117209 (owner: 10Herron) [15:42:20] (03PS1) 10Arturo Borrero Gonzalez: prometheus: kernel-messages-ignore-regex: add more ACPI messages [puppet] - 10https://gerrit.wikimedia.org/r/1117215 [15:42:39] !log herron@dns1004 START - running authdns-update [15:43:40] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: kernel-messages-ignore-regex: add more ACPI messages [puppet] - 10https://gerrit.wikimedia.org/r/1117215 (owner: 10Arturo Borrero Gonzalez) [15:44:36] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115917|Remove flag wgSecurePollSingleTransferableVoteEnabled (T376930)]] (duration: 11m 21s) [15:44:38] !log herron@dns1004 END - running authdns-update [15:44:39] T376930: remove $wgSecurePollSingleTransferableVoteEnabled? - https://phabricator.wikimedia.org/T376930 [15:44:51] Reedy: should be done [15:44:58] cheers [15:45:23] (03Merged) 10jenkins-bot: Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117169 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:45:24] (03Merged) 10jenkins-bot: Hooks: Check for null option in onSpecialMuteModifyFormFields [extensions/Echo] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117170 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:47:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73177 and previous config saved to /var/cache/conftool/dbconfig/20250204-154740-root.json [15:48:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:50:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:51:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:52:29] (03Merged) 10jenkins-bot: Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1117210 (https://phabricator.wikimedia.org/T385588) (owner: 10Reedy) [15:52:31] (03Merged) 10jenkins-bot: Poem: Null coalescence $in [extensions/Poem] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117211 (https://phabricator.wikimedia.org/T385588) (owner: 10Reedy) [15:53:09] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [15:54:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10521568 (10Jhancock.wm) that's a new flag for me. ty. it did work and it at least started this time. but it did crash. at a similar spot to two SM servers... [15:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521569 (10phaultfinder) [15:55:58] (03CR) 10Vgutierrez: [C:03+2] liberica: Reload config using SIGHUP [puppet] - 10https://gerrit.wikimedia.org/r/1117157 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:56:51] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [15:59:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:00:05] jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1600). [16:00:53] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache _etcd-server-ssl._tcp.aux-k8s-etcd.codfw.wmnet on all recursors [16:00:57] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd-server-ssl._tcp.aux-k8s-etcd.codfw.wmnet on all recursors [16:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73178 and previous config saved to /var/cache/conftool/dbconfig/20250204-160246-root.json [16:03:06] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1117210|Poem: Null coalescence $in (T385588)]], [[gerrit:1117211|Poem: Null coalescence $in (T385588)]], [[gerrit:1117169|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]], [[gerrit:1117170|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]] [16:03:10] T385588: PHP Deprecated: preg_replace_callback(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385588 [16:03:10] T385169: PHP Deprecated: preg_split(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385169 [16:04:55] Lucas_WMDE: Think you were just unlucky [16:05:04] 16:04:30 Finished build-and-push-container-images (duration: 00m 57s) [16:05:09] with the deployment duration? [16:05:14] yeah, ok [16:05:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10521626 (10Papaul) @Jhancock.wm thank you for working on this. Yes for all ganeti nodes we enable virtualization because these nodes will be running VM's... [16:06:21] !log reedy@deploy2002 reedy: Backport for [[gerrit:1117210|Poem: Null coalescence $in (T385588)]], [[gerrit:1117211|Poem: Null coalescence $in (T385588)]], [[gerrit:1117169|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]], [[gerrit:1117170|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:06:24] !log reedy@deploy2002 reedy: Continuing with sync [16:09:15] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:17] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:01] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1117219 [16:12:57] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117210|Poem: Null coalescence $in (T385588)]], [[gerrit:1117211|Poem: Null coalescence $in (T385588)]], [[gerrit:1117169|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]], [[gerrit:1117170|Hooks: Check for null option in onSpecialMuteModifyFormFields (T385169)]] (duration: 09m 50s) [16:13:01] T385588: PHP Deprecated: preg_replace_callback(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385588 [16:13:01] T385169: PHP Deprecated: preg_split(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385169 [16:13:48] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:13:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73179 and previous config saved to /var/cache/conftool/dbconfig/20250204-161355-root.json [16:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521652 (10phaultfinder) [16:16:56] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: replace faulty optic et-0/0/0 [16:17:03] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10521665 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a50b2671-d855-40a0-8790-c502280b9115) set by cmooney@cumin100... [16:17:09] !log disable et-0/0/0 on cr3-ulsfo to prep for optic replacement [16:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:14] !log disable et-0/0/0 on cr3-ulsfo to prep for optic replacement T384288 [16:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:17] T384288: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288 [16:17:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73180 and previous config saved to /var/cache/conftool/dbconfig/20250204-161751-root.json [16:24:34] Can someone grant me write access to the extension-MathSearch user group in gerrit https://gerrit.wikimedia.org/r/admin/groups/82667f024697797a0f63477b3cee35ac03040535 [16:26:23] I can currently only edit groups math and extension-math. However, this does not allow to grant access to MathSearch (which is not WMF deployed) alone [16:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:28:42] physikerwelt: I suggest you want #wikimedia-releng [16:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73181 and previous config saved to /var/cache/conftool/dbconfig/20250204-162900-root.json [16:29:13] physikerwelt: you are already in there I think? https://gerrit.wikimedia.org/r/admin/groups/82667f024697797a0f63477b3cee35ac03040535,members [16:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521724 (10phaultfinder) [16:30:56] bd808: yes, however, in contrast to the group extension-math who is its own owner, the owner of mathsearch is administrators. That is why I can only add to extension-math but not to extension-mathsearch [16:32:01] (see https://gerrit.wikimedia.org/r/admin/groups/84251d6a3201abef86657bfb35019bc0199c1b5e) [16:32:44] ah, got it. If I had admin on gerrit I would fix. As RhinosF1 mentioned you may find more help in #wikimedia-releng or with a phab task [16:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:32:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73182 and previous config saved to /var/cache/conftool/dbconfig/20250204-163256-root.json [16:36:40] (03CR) 10Ilias Sarantopoulos: [C:03+1] knative: fix patch command and backport for patches for PSS migration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117207 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:36:45] RhinosF1: bd808: thank you [16:37:11] (03CR) 10Elukey: [V:03+2 C:03+2] knative: fix patch command and backport for patches for PSS migration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117207 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:44:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73183 and previous config saved to /var/cache/conftool/dbconfig/20250204-164405-root.json [16:47:36] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10521778 (10cmooney) >>! In T384288#10505646, @RobH wrote: > Remote hands 01020815 scheduled for 2025-02-04 @ 0800 Pacific (1600 GMT). Ha... [16:48:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73184 and previous config saved to /var/cache/conftool/dbconfig/20250204-164802-root.json [16:48:40] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:49:59] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:50:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:51:09] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:52:20] (03PS3) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [16:52:20] (03PS1) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [16:53:26] (03CR) 10CI reject: [V:04-1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [16:56:18] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:28] !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs4008.ulsfo.wmnet with OS bookworm [16:57:52] (03PS4) 10Herron: aux-k8s-etcd: bootstrap codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1117219 (https://phabricator.wikimedia.org/T381417) [16:58:04] (03CR) 10Herron: [V:03+1 C:03+2] aux-k8s-etcd: bootstrap codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1117219 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:58:49] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bookworm [16:59:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73185 and previous config saved to /var/cache/conftool/dbconfig/20250204-165909-root.json [17:00:07] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1700). [17:00:08] No Gerrit patches in the queue for this window AFAICS. [17:01:12] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:49] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:11:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:12:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:13:49] FIRING: [3x] PuppetFailure: Puppet has failed on aux-k8s-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:14:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73186 and previous config saved to /var/cache/conftool/dbconfig/20250204-171415-root.json [17:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10521945 (10phaultfinder) [17:15:16] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS4265007001/IPv4: Connect - asw1-b3-magru https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:16:05] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:17:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:16] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:59] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:18:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:20:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10521978 (10elukey) Very interesting - running the provision cookbook with --uefi worked fine, then I retried to restore legacy/bios (removing --uefi from... [17:23:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on aux-k8s-etcd2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:27:32] (03PS1) 10Jforrester: [wikifunctionswiki] Set flags for repo mode (on) and client (off) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117228 [17:28:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117228 (owner: 10Jforrester) [17:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10522011 (10phaultfinder) [17:33:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:16] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:33] (03PS1) 10Elukey: admin_ng: bump knative serving's default image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117230 (https://phabricator.wikimedia.org/T369493) [17:36:02] (03PS1) 10Elukey: sre.hosts.reimage: add logging and confirmation when forcing puppet 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 [17:37:31] ^^ BGP errors expected, lvs4008 enjoying its second reimage today [17:40:33] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bookworm [17:41:14] (03PS1) 10Vgutierrez: hiera: Restore lvs4008 priority [puppet] - 10https://gerrit.wikimedia.org/r/1117232 (https://phabricator.wikimedia.org/T384477) [17:41:18] (03CR) 10Elukey: [C:03+2] admin_ng: bump knative serving's default image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117230 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [17:42:18] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:42:30] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:44:45] (03CR) 10Fabfur: [C:03+1] hiera: Restore lvs4008 priority [puppet] - 10https://gerrit.wikimedia.org/r/1117232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:45:49] (03PS2) 10Elukey: sre.hosts.reimage: add logging and confirmation when forcing puppet 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 [17:46:15] (03CR) 10Ssingh: [C:03+1] hiera: Restore lvs4008 priority [puppet] - 10https://gerrit.wikimedia.org/r/1117232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:46:55] (03PS3) 10Elukey: sre.hosts.reimage: add logging and confirmation when forcing puppet 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 [17:47:21] (03CR) 10Volans: sre.hosts.reimage: add logging and confirmation when forcing puppet 5 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 (owner: 10Elukey) [17:47:40] jouncebot: nowandnext [17:47:40] For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1700) [17:47:40] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1800) [17:47:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:47:47] !log codesearch.wmflabs.org - hard reboot instance for needed mass reboots in cloud VPS [17:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:28] (03CR) 10Volans: [C:03+1] "LGTM as an interim solution while we check if we can remove most of the compatibility code" [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 (owner: 10Elukey) [17:48:31] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs4008 priority [puppet] - 10https://gerrit.wikimedia.org/r/1117232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:48:44] FYI, since things appear to be quiet, I'm going to move ahead with some of the pre-work for the upcoming infra window [17:49:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10522096 (10elukey) This is very weird - I went to BIOS and selected BIOS Mode UEFI, then reselected Legacy. Saved and reset. re-ran the cookbook and: ` e... [17:49:45] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next to 15% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116891 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:50:54] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next to 15% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116891 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:51:34] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:51:41] !log repooling lvs4008 - T384477 [17:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:43] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [17:53:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:56:48] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add logging and confirmation when forcing puppet 5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 (owner: 10Elukey) [18:00:05] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1800). [18:00:11] o/ [18:00:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:00:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:01:03] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:01:21] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:02:09] !log scaled mw-web next to 15% of main release - T383845 [18:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:13] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:03:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:04:01] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:04:40] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:04:53] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:05:02] !log scaled mw-api-ext next to 15% of main release - T383845 [18:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116892 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:08:32] (03Merged) 10jenkins-bot: Enroll 25% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116892 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:08:59] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1116892|Enroll 25% of client sessions in PHP 8.1 (T383845)]] [18:09:02] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:10:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10522166 (10VRiley-WMF) pay-lb1001 cable ID A: 230304500172 cable ID B: 230304500232 Port 34 Pay-lb1002 cableID A: 230304500228 cableID B: 2303045001... [18:11:20] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10522170 (10Ahonc) I send same request in Postman, and got such timeline: It waits something 125 sec {F58356964} [18:12:08] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1116892|Enroll 25% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:13:54] !log swfrench@deploy2002 swfrench: Continuing with sync [18:15:49] (03CR) 10Cwhite: [C:03+2] statsd-exporter: set ttl to 30d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [18:18:03] (03Merged) 10jenkins-bot: statsd-exporter: set ttl to 30d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [18:19:30] (03PS1) 10Herron: aux-k8s-etcd: set bootstrap false [puppet] - 10https://gerrit.wikimedia.org/r/1117235 [18:20:24] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1116892|Enroll 25% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 25s) [18:20:27] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:23:12] holding for a bit while traffic ramps up before moving ahead with work on mw-api-int [18:23:20] (03PS2) 10Herron: aux-k8s-etcd: set bootstrap false [puppet] - 10https://gerrit.wikimedia.org/r/1117235 (https://phabricator.wikimedia.org/T381417) [18:25:03] (03CR) 10Herron: [C:03+2] aux-k8s-etcd: set bootstrap false [puppet] - 10https://gerrit.wikimedia.org/r/1117235 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [18:30:02] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:29] !log cwhite@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [18:30:32] !log cwhite@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [18:31:33] !log cwhite@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:31:36] !log cwhite@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:32:35] (03CR) 10Scott French: [C:03+2] mw-api-int: serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116893 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:33:45] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10522281 (10herron) [18:33:47] (03Merged) 10jenkins-bot: mw-api-int: serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116893 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:34:51] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10522286 (10ssingh) Hi @Ahonc: Can you also share a traceroute to two other domains for comparison? Please share the output for traceroute to `google.com` and `text-lb.drmrs.wikime... [18:35:55] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:36:15] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:36:24] RECOVERY - Check unit status of etcd-backup on aux-k8s-etcd2004 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:36:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:37:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:39:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:39:35] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:39:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10522293 (10phaultfinder) [18:39:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:40:01] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:40:24] RECOVERY - Check unit status of etcd-backup on aux-k8s-etcd2005 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:48] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10522294 (10Ahonc) > tracert text-lb.drmrs.wikimedia.org. Tracing route to text-lb.drmrs.wikimedia.org [185.15.58.224] over a maximum of 30 hops: 1 2 ms 1 ms 3 ms... [18:42:12] !log mw-api-int to ~ 2% of traffic on PHP 8.1 - T383845 [18:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:15] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:42:22] RECOVERY - Check unit status of etcd-backup on aux-k8s-etcd2003 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:51:24] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2005 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:51:43] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10522330 (10ssingh) Thanks for sharing. It is interesting that your connection to drmrs and google is fairly what you would expect but to dyna (text-lb.esams), the second hop laten... [18:53:33] (03CR) 10Urbanecm: Babel: Remove config that is now in community configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115518 (https://phabricator.wikimedia.org/T385239) (owner: 10Urbanecm) [18:54:55] FIRING: SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:05] jnuche and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T1900). [19:03:02] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:03:14] PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:03:16] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:03:16] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:03:22] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2003 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:04:55] FIRING: [2x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:24] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd2004 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10522426 (10Neobeta61) @elukey seems like we got to the response you needed on the other tickets. [19:25:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10522474 (10phaultfinder) [19:25:40] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [19:26:03] (03CR) 10Scott French: "I was chatting with @cwhite@wikimedia.org about why this produces no diffs for the `prometheus` standalone statsd exporter releases in the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [19:33:38] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10522506 (10Ahonc) These pages I cannot edit: https://w.wiki/CvKT , https://w.wiki/BNFa , but this https://w.wiki/CwnG I can edit (I tried 3 times and all three it was saved) and t... [19:37:04] (03Abandoned) 10Jcrespo: Revert "dbbackups: Pause s3/db1240 snapshots until load completes" [puppet] - 10https://gerrit.wikimedia.org/r/1047059 (owner: 10Jcrespo) [19:40:02] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:40:18] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:42:14] RECOVERY - BFD status on cr2-magru is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:42:16] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:49:39] (03PS1) 10Jforrester: wikifunctions: Upgrade function-orchestrator RAM request, given heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117243 (https://phabricator.wikimedia.org/T384883) [19:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T384592)', diff saved to https://phabricator.wikimedia.org/P73187 and previous config saved to /var/cache/conftool/dbconfig/20250204-195211-marostegui.json [19:52:15] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:53:12] (03PS1) 10Fabfur: external_cloud_vendors: Added OpenAI IP lists [puppet] - 10https://gerrit.wikimedia.org/r/1117245 (https://phabricator.wikimedia.org/T385616) [19:59:05] !log disabled puppet on A:cp-text before merging https://gerrit.wikimedia.org/r/1084247 [19:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:29] (03CR) 10Scott French: "Thank you both for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [19:59:32] (03CR) 10Scott French: [C:03+2] gateway-check: fix invalid config handling [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [19:59:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10522667 (10Dwisehaupt) @VRiley-WMF frnetmon1001 doesn't require 10G support, so if it only has 1G connections that is fine. We'd just like to make su... [20:07:18] !log verified behavior of https://gerrit.wikimedia.org/r/1084247 on cp4040 [20:07:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P73188 and previous config saved to /var/cache/conftool/dbconfig/20250204-200718-marostegui.json [20:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:51] !log running puppet on A:cp-text after merging https://gerrit.wikimedia.org/r/1084247 [20:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:50] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3572 MB (3% inode=98%): /tmp 3572 MB (3% inode=98%): /var/tmp 3572 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [20:22:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P73189 and previous config saved to /var/cache/conftool/dbconfig/20250204-202225-marostegui.json [20:36:32] !log finished running puppet on A:cp-text after merging https://gerrit.wikimedia.org/r/1084247 [20:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T384592)', diff saved to https://phabricator.wikimedia.org/P73190 and previous config saved to /var/cache/conftool/dbconfig/20250204-203732-marostegui.json [20:37:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:37:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [20:37:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T384592)', diff saved to https://phabricator.wikimedia.org/P73191 and previous config saved to /var/cache/conftool/dbconfig/20250204-203754-marostegui.json [20:48:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10522873 (10Papaul) @Dwisehaupt both servers are now on 10G on the switch side ` xe-0/0/34 up down pay-lb1001 {#230304500172} xe-0/0/33... [20:51:55] (03PS8) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [20:53:14] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [20:55:23] (03PS9) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [20:56:41] (03CR) 10CI reject: [V:04-1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [20:56:44] Oh, if it's just me for deployment, I'll handle it myself, no need for someone else to waste their time. [20:57:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10522889 (10Papaul) a:03Dwisehaupt [20:57:41] (03PS1) 10C. Scott Ananian: Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117251 [20:58:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117251 (owner: 10C. Scott Ananian) [20:58:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117228 (owner: 10Jforrester) [20:59:17] (03Merged) 10jenkins-bot: [wikifunctionswiki] Set flags for repo mode (on) and client (off) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117228 (owner: 10Jforrester) [20:59:44] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1117228|[wikifunctionswiki] Set flags for repo mode (on) and client (off)]] [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T2100). [21:00:05] James_F and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] James_F: any chance you can deploy my patch as well? [21:00:14] Aha, cscott got there. [21:00:19] Fiiiine. :-) [21:00:36] (03CR) 10Jforrester: [C:03+2] Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117251 (owner: 10C. Scott Ananian) [21:00:37] it's fragment support for wikifunctions, so it's in your self-interest ;-p [21:00:44] Yes, I know. :-) [21:01:13] (03CR) 10CDanis: [C:03+2] varnish: x-analytics: Authorization header summary [puppet] - 10https://gerrit.wikimedia.org/r/1111695 (owner: 10CDanis) [21:01:57] cscott: Whilst I'm waiting, should I merge https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1115139 ? [21:02:53] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1117228|[wikifunctionswiki] Set flags for repo mode (on) and client (off)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:03:08] no, we're going to wait until the wmf.15 train rolles to do some final testing before merging 1115139 (i just updated gerrit with this info) [21:03:11] !log jforrester@deploy2002 jforrester: Continuing with sync [21:03:35] apparently subbu and arlo landed some additional fixes for that template expansion mode this week, and we want to make sure they are effective before fully committing to it. [21:04:42] Ack. [21:04:58] * James_F twiddles thumbs some more, then. [21:08:06] (03PS10) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [21:09:40] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117228|[wikifunctionswiki] Set flags for repo mode (on) and client (off)]] (duration: 09m 56s) [21:09:43] Finally. [21:10:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117251 (owner: 10C. Scott Ananian) [21:10:31] (03PS2) 10Jforrester: ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto) [21:11:08] (03PS3) 10Jforrester: ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto) [21:11:58] it sure would be nice if CI were a little bit faster [21:12:12] (03Merged) 10jenkins-bot: Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers [core] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117251 (owner: 10C. Scott Ananian) [21:12:16] It is. [21:12:33] 10 minutes is reasonable for core. [21:12:40] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1117251|Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers]] [21:12:42] The 10 minutes for a trivial deploy, less so. [21:13:06] the 10 becomes 20 if selenium flakes out, which happened to me last deploy. [21:13:27] Not any more. I killed selenium from wmf branches. [21:13:34] <3 [21:13:48] As of last Wednesday: https://gerrit.wikimedia.org/r/c/integration/config/+/1109435 [21:14:06] (03CR) 10Jforrester: [C:03+1] ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto) [21:14:28] it was 18 minutes each run, the time in question, and then another 18 minutes because selenium sucked, so almost 40 minutes of the hour long deploy window for a single patch. [21:14:38] 10 minutes is much better, agreed. [21:14:50] I mean, I'd prefer it to be 5 minutes long. [21:14:52] But… [21:15:56] yep, could always be faster. we were chatting last week about prestaging commits so the CI would all run *before* the deploy window, and the actual deployers wouldn't have to wait for it. [21:17:29] (03CR) 10Jforrester: "Presumably this is still needed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026982 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [21:17:40] !log jforrester@deploy2002 cscott, jforrester: Backport for [[gerrit:1117251|Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:25] ok, let me poke at group0 and see if i can break anything [21:18:30] Ta. [21:19:15] (03Abandoned) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [21:19:19] (03Abandoned) 10Jforrester: Variant configuration: Read and write variant config from conf-dir, not /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554977 (owner: 10Jforrester) [21:19:25] (03Abandoned) 10Jforrester: Make it possible to load site config from InitialiseSettings.json as well as .php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553209 (owner: 10Jforrester) [21:19:37] (03Abandoned) 10Jforrester: Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 (owner: 10Jforrester) [21:19:42] (03Abandoned) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:19:47] (03Abandoned) 10Jforrester: Drop getMWConfigForCacheing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538343 (owner: 10Jforrester) [21:20:56] (03PS2) 10Jforrester: Drop bare-metal servers from Wikimedia Debug tool config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949) [21:21:19] (03CR) 10Jforrester: "PS2: Just a rebase. Still waiting on resolution of T324003." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [21:22:46] James_F: looks good [21:22:50] !log jforrester@deploy2002 cscott, jforrester: Continuing with sync [21:22:52] Cool. [21:23:07] (03PS2) 10Jforrester: Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070247 (https://phabricator.wikimedia.org/T369949) [21:26:18] (03PS1) 10Ladsgroup: Set file migration to write both everywhere except commons and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117260 (https://phabricator.wikimedia.org/T384481) [21:26:20] (03PS1) 10Ladsgroup: Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) [21:26:38] Amir1: Did you want any of ^ deployed now? [21:27:02] Actually all of them (and portals too) but I can wait [21:27:05] (03CR) 10CI reject: [V:04-1] Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [21:27:06] Ha. [21:27:07] (03PS2) 10BCornwall: varnish: Enable single_backend by default [puppet] - 10https://gerrit.wikimedia.org/r/1115086 [21:27:07] (03PS11) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [21:27:13] https://integration.wikimedia.org/ci/view/All/job/wikimedia-portals-build/494/ [21:27:19] OK, but first I really should get out a patch I forgot about for six months. [21:27:30] Ack. [21:28:50] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117262 [21:29:01] sure, I have no rush, seriously [21:29:20] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117251|Parsoid fragment support: fix handling of 'nowiki' and 'general' strip markers]] (duration: 16m 39s) [21:29:24] Sync-apaches is so fast now we have almost none in prod. [21:29:28] thank you so much James_F ! [21:29:45] cscott: A pleasure as always. Happy hacking. [21:29:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070247 (https://phabricator.wikimedia.org/T369949) (owner: 10Jforrester) [21:30:27] (03Merged) 10jenkins-bot: Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070247 (https://phabricator.wikimedia.org/T369949) (owner: 10Jforrester) [21:30:41] (03CR) 10BCornwall: "I set it to `false` since it would otherwise fail due to not being (codfw|drmrs) and the default was set to `true` in the related commit c" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [21:30:57] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1070247|Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui (T369949)]] [21:31:00] T369949: WikiLambda metrics: Remove the wikifunctions.ui stream - https://phabricator.wikimedia.org/T369949 [21:32:07] (03PS12) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [21:36:06] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1070247|Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui (T369949)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:36:08] T369949: WikiLambda metrics: Remove the wikifunctions.ui stream - https://phabricator.wikimedia.org/T369949 [21:42:19] !log jforrester@deploy2002 jforrester: Continuing with sync [21:44:59] Amir1: Over to you, once this scap finishes. [21:45:30] awesome. Thanks you! [21:46:11] (03PS15) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [21:48:41] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1070247|Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui (T369949)]] (duration: 17m 43s) [21:48:45] T369949: WikiLambda metrics: Remove the wikifunctions.ui stream - https://phabricator.wikimedia.org/T369949 [21:49:37] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4927/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [21:49:42] (03CR) 10BCornwall: "Oh, I had my logic wrong. Sorry, you're right 😊. We don't need to set anything at all because it's set to `true` by default anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [21:49:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117260 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [21:50:28] (03Merged) 10jenkins-bot: Set file migration to write both everywhere except commons and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117260 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [21:50:56] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1117260|Set file migration to write both everywhere except commons and enwiki (T384481)]] [21:50:59] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [21:53:50] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1117260|Set file migration to write both everywhere except commons and enwiki (T384481)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:35] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250204T2200) [22:01:58] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117260|Set file migration to write both everywhere except commons and enwiki (T384481)]] (duration: 11m 01s) [22:02:01] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [22:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [22:03:54] (03CR) 10CI reject: [V:04-1] Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [22:04:09] (03PS3) 10Ladsgroup: Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) [22:04:21] (03CR) 10Ladsgroup: [C:03+2] Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [22:04:38] (03CR) 10TrainBranchBot: "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [22:05:08] (03Merged) 10jenkins-bot: Set categorylinks to write both in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117261 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [22:05:35] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1117261|Set categorylinks to write both in group0 (T385164)]] [22:05:37] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [22:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/migration at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:10:15] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1117261|Set categorylinks to write both in group0 (T385164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:12:42] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [22:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/migration at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:18:55] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117261|Set categorylinks to write both in group0 (T385164)]] (duration: 13m 20s) [22:18:58] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [22:20:48] (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117262 (owner: 10Ladsgroup) [22:22:29] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117262 (owner: 10Ladsgroup) [22:26:26] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:32:37] !log ladsgroup@deploy2002 Synchronized portals/wikipedia.org/assets: Bump portals to HEAD (T368221 T373204) (duration: 09m 30s) [22:32:41] T368221: Dark mode for Wikimedia portals (e.g. www.wikipedia.org) - https://phabricator.wikimedia.org/T368221 [22:32:41] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [22:34:50] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3446 MB (3% inode=98%): /tmp 3446 MB (3% inode=98%): /var/tmp 3446 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:35:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 568MiB (3% inode=35%): /tmp 568MiB (3% inode=35%): /var/tmp 568MiB (3% inode=35%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [22:35:50] !log ladsgroup@deploy2002 Synchronized portals: Bump portals to HEAD (duration: 03m 12s) [22:37:38] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [22:37:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T371742)', diff saved to https://phabricator.wikimedia.org/P73192 and previous config saved to /var/cache/conftool/dbconfig/20250204-223744-ladsgroup.json [22:37:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:49:14] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4929/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115086 (owner: 10BCornwall) [22:55:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [23:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:27:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T371742)', diff saved to https://phabricator.wikimedia.org/P73193 and previous config saved to /var/cache/conftool/dbconfig/20250204-232748-ladsgroup.json [23:27:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:34:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1257:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1257 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:42:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P73194 and previous config saved to /var/cache/conftool/dbconfig/20250204-234255-ladsgroup.json [23:49:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1257:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1257 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:58:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P73195 and previous config saved to /var/cache/conftool/dbconfig/20250204-235802-ladsgroup.json