[00:08:22] just got one in #wikipedia-en-help a few minutes ago, appears to be a DNS issue [00:26:40] (03CR) 10Ssingh: haproxykafka: profile and hiera files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [00:31:02] (03PS5) 10Scott French: httpd: introduce -bookworm track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) [00:31:49] (03CR) 10Ssingh: [C:03+1] varnish: Move wm_recv_purge subroutine to inline [puppet] - 10https://gerrit.wikimedia.org/r/1083914 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1083941 [00:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1083941 (owner: 10TrainBranchBot) [00:53:10] (03CR) 10Scott French: "Thanks, Luca!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [01:05:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1083941 (owner: 10TrainBranchBot) [01:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1083946 [01:08:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1083946 (owner: 10TrainBranchBot) [01:16:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:37:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1083946 (owner: 10TrainBranchBot) [01:57:04] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/be981033c038d5376b5e26d82569701eda16e6238f828ae73805db427c9255f3/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0200) [02:01:32] (03PS5) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [02:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.1 [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083951 (https://phabricator.wikimedia.org/T375660) [02:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.1 [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083951 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [02:15:56] (03PS1) 10Hamish: annwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083952 (https://phabricator.wikimedia.org/T377535) [02:17:04] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:37:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:54] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.1 [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083951 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0300) [03:02:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:18] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083956 (https://phabricator.wikimedia.org/T375660) [03:02:20] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083956 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [03:03:05] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083956 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [03:03:30] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.1 refs T375660 [03:04:04] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [03:10:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:15:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:46:26] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:53:21] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.1 refs T375660 (duration: 49m 51s) [03:53:26] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0400) [04:01:05] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.26 (duration: 01m 04s) [04:02:02] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:03:52] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:52] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:52] (03PS1) 10Reedy: SpecialGadgets: Replace deprecated SkinFactory::getSkinNames() call [extensions/Gadgets] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083962 (https://phabricator.wikimedia.org/T377521) [05:02:15] (03PS1) 10Reedy: ChangeSkinPref: Replace deprecated SkinFactory::getSkinNames() call [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083963 (https://phabricator.wikimedia.org/T377521) [05:16:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0600) [06:00:05] marostegui, Amir1, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0600). [06:26:14] PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100% [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:07:00] (03PS1) 10Kosta Harlan: StatsLib: Set label for wiki ID [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) [07:07:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:15:51] (03CR) 10CI reject: [V:04-1] StatsLib: Set label for wiki ID [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [07:31:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1083835 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [07:31:34] (03CR) 10Slyngshede: [C:03+2] Show currently signed in username. [software/bitu] - 10https://gerrit.wikimedia.org/r/1083835 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [07:34:00] (03PS1) 10Muehlenhoff: Remove ircstream-ssh CNAME [dns] - 10https://gerrit.wikimedia.org/r/1083971 (https://phabricator.wikimedia.org/T376014) [07:34:09] (03Merged) 10jenkins-bot: Show currently signed in username. [software/bitu] - 10https://gerrit.wikimedia.org/r/1083835 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [07:34:09] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083972 (https://phabricator.wikimedia.org/T373695) [07:36:27] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083973 (https://phabricator.wikimedia.org/T373695) [07:37:30] (03CR) 10Slyngshede: "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1083971 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [07:37:36] (03CR) 10Slyngshede: [C:03+1] Remove ircstream-ssh CNAME [dns] - 10https://gerrit.wikimedia.org/r/1083971 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [07:47:02] (03CR) 10Muehlenhoff: [C:03+2] Remove ircstream-ssh CNAME [dns] - 10https://gerrit.wikimedia.org/r/1083971 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [07:53:10] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc1004.wikimedia.org [07:56:32] (03CR) 10Ayounsi: [C:03+1] librenms: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075616 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [07:58:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:58:21] (03PS12) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [07:58:21] (03PS12) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [07:59:43] (03CR) 10Stevemunene: [C:03+2] Add new presto hosts to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/1083756 (https://phabricator.wikimedia.org/T374924) (owner: 10Stevemunene) [08:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T0800). [08:00:05] kostajh and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] hello [08:00:59] (03CR) 10Kosta Harlan: "check phan" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [08:01:06] I'm not sure what's up with the phan failure [08:01:18] (03CR) 10Ayounsi: [C:03+1] "As this is supposed to only be for additional testing for a short period of time, and the provision cookbook is rarely ran more than once " [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [08:01:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:01:52] I can deploy [08:03:34] !log installing qemu security updates [08:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:09] (03CR) 10Madgregory: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [08:04:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:04:39] (03PS13) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:06:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:06:42] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - index rebuilt [08:06:47] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1169 gradually with 4 steps - index rebuilt [08:06:59] hashar: do you have any ideas about the phan failure for https://gerrit.wikimedia.org/r/1083967 ? [08:07:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:07:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc1004.wikimedia.org [08:07:28] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10271081 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin20... [08:08:07] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - index rebuilt [08:08:12] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1169 gradually with 4 steps - index rebuilt [08:08:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10271097 (10MoritzMuehlenhoff) [08:08:50] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db1169 quickly with 2 steps - index rebuilt [08:08:55] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1169 quickly with 2 steps - index rebuilt [08:08:59] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10271100 (10MoritzMuehlenhoff) [08:09:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: T378320', diff saved to https://phabricator.wikimedia.org/P70599 and previous config saved to /var/cache/conftool/dbconfig/20241029-080951-arnaudb.json [08:09:56] T378320: db1169 replication broken - enwiki.pagelinks corruption - https://phabricator.wikimedia.org/T378320 [08:09:58] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts irc2004.wikimedia.org [08:12:52] (03CR) 10Kosta Harlan: "recheck" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [08:13:40] (03PS14) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:14:06] anzx: I ran the purgeList.php for the URLs you requested [08:14:18] kostajh: thank you [08:15:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:15:19] (03PS1) 10Muehlenhoff: Remove support for using ircstream with eventstream [puppet] - 10https://gerrit.wikimedia.org/r/1084026 (https://phabricator.wikimedia.org/T376014) [08:15:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084026 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [08:17:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:18:00] hi :) [08:18:18] kostajh: I don't know much about Phan / nowadays PHP magic [08:19:05] I found the problem [08:19:13] it is probably a legit failure [08:19:15] ah good [08:19:38] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1082471 didn't make it to wmf.28 [08:19:58] (03PS15) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:21:14] so my plan is to add a phan suppression for the line [08:21:24] we really need to gate everything together [08:21:25] hashar: can I use `git review` to push to a cherry-picked patch in gerrit? [08:21:32] or at least have a shared phan job for everything that is deployed [08:21:43] git checkout wmf/x.y.z [08:21:48] git-review -x 123456 [08:21:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:21:59] well you don't even need the checkout [08:22:22] but yeah `git-review -x` fetches the latest patchset of the given change number and cherry pick it [08:22:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:22:31] (when `git-review -d` does a checkout) [08:22:34] I did `git review -d 1083967` and added a phan suppression locally [08:22:38] now I need to push the change back up [08:23:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: irc2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:23:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:23:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts irc2004.wikimedia.org [08:23:17] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10271169 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin20... [08:23:28] don't you need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1082471 ? [08:23:30] `git review` fails because I guess it thinks I'm pushing to the Change-Id (which is closed) [08:24:02] hashar: I could do that, I guess that is easiest [08:24:14] (03CR) 10Arnaudb: [C:03+2] mariadb: add db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1083813 (https://phabricator.wikimedia.org/T374951) (owner: 10Arnaudb) [08:24:21] (03PS1) 10Kosta Harlan: Unblock CI [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084027 (https://phabricator.wikimedia.org/T377947) [08:24:31] cause if you ignore phan, that is like hiding the problem under the carpet :-D [08:24:41] heh [08:24:45] alright [08:24:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: T378320', diff saved to https://phabricator.wikimedia.org/P70600 and previous config saved to /var/cache/conftool/dbconfig/20241029-082456-arnaudb.json [08:25:02] T378320: db1169 replication broken - enwiki.pagelinks corruption - https://phabricator.wikimedia.org/T378320 [08:25:02] I'll backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1084027 [08:25:12] Hey there [08:25:28] can I sneak in one quick beta config-change? [08:25:34] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1083836 [08:25:51] MichaelG_WMF: yes just do it now [08:25:52] :] [08:26:01] MichaelG_WMF: sure [08:26:08] MichaelG_WMF: can you add it to the calendar? [08:26:12] I can't, I'm not a deployer 🥺 [08:26:19] add to the calendar - yes [08:26:26] ah we need to train you up! [08:26:31] I'll merge it for you [08:26:49] thanks [08:27:02] need to add it manually to the calendar because the window already started ... [08:27:28] (03PS2) 10Hashar: StatsLib: Set label for wiki ID [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [08:27:41] kostajh: I have added a Depends-On, we will see how Phan behave [08:28:12] hashar: won't depends-on not work, since they are both in the same repo? [08:28:23] ah [08:28:24] I planned to sync the "Unblock CI" one and then the StatsLib one [08:28:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: provisionning db2223.codfw.wmnet - T373579 [08:28:36] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [08:28:41] yeah so I guess they could have been chained instead [08:28:44] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 28306 [08:28:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: provisionning db2223.codfw.wmnet - T373579 [08:28:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: provisionning db2223.codfw.wmnet - T373579 [08:28:53] but the depends-on should work [08:28:58] or maybe I can do both [08:29:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: provisionning db2223.codfw.wmnet - T373579 [08:29:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool preshot db2211', diff saved to https://phabricator.wikimedia.org/P70601 and previous config saved to /var/cache/conftool/dbconfig/20241029-082903-arnaudb.json [08:29:05] !log arnaudb@cumin1002 START - Cookbook sre.mysql.depool db2211 - depooling db2211 to clone on db2223 [08:29:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2211 - depooling db2211 to clone on db2223 [08:29:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28306 [08:29:14] https://integration.wikimedia.org/ci/job/mwext-php74-phan/50614/console SUCCESS [08:29:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084027 (https://phabricator.wikimedia.org/T377947) (owner: 10Kosta Harlan) [08:29:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [08:29:28] let's see :) [08:29:46] Added it to the calendar ✅ [08:29:47] hashar: I don't know how to chain cherry-picks in gerrit :/ [08:30:03] so the `Depends-On` do it for you [08:30:13] for chaining cherry-picks, you do it locally [08:30:18] git checkout wmf/x.y.z [08:30:24] then fetch the changes you want to chain: [08:30:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2211 in db2223 for T373579', diff saved to https://phabricator.wikimedia.org/P70602 and previous config saved to /var/cache/conftool/dbconfig/20241029-083035-arnaudb.json [08:30:37] git-review -x 1234 [08:30:37] git-review -x 2199 [08:30:42] it's the pushing back to gerrit part I don't know about [08:30:55] (03PS16) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:31:03] then push that series to refs/for/wmf/x.y.Z [08:31:21] or well just use `git-review` which would push your local branch to the magic refs/for/* ref [08:31:46] upon receiving the branch update and the 2..n commits, Gerrit will create changes or add patchsets to existing changes [08:31:51] hashar: depends-on seemed to work https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1083967?checksRunsSelected=test-wmf&tab=checks [08:31:51] and you end up with a chain of changes [08:32:12] hashar: `git review` doesn't work with cherry-pick for me, it seems to think I am pushing to the closed patch (what is referenced in Change-Id) [08:32:29] git-review -v -n [08:32:37] -v for verbose, -n for dry-run [08:32:54] that would give you all the underlying magic commands. Most probably your local branch points to upstream origin/master [08:33:02] in which the change already got merged and thus Gerrit refuses it [08:33:33] (03PS1) 10Volans: sre.mysql.pool: fix check for spurious changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1084029 [08:33:36] if your local wmf branch pointed to origin/wmf/x.y.z , then git-review ould try to push to `refs/for/wmf/x.y.z` [08:33:39] something like that [08:34:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:34:35] (03CR) 10Arnaudb: [C:03+1] "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084029 (owner: 10Volans) [08:38:12] (03Merged) 10jenkins-bot: Unblock CI [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084027 (https://phabricator.wikimedia.org/T377947) (owner: 10Kosta Harlan) [08:38:15] alright, thx [08:39:05] (03PS17) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:39:12] (03Merged) 10jenkins-bot: StatsLib: Set label for wiki ID [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1083967 (https://phabricator.wikimedia.org/T375496) (owner: 10Kosta Harlan) [08:39:48] (03PS42) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [08:39:59] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1084027|Unblock CI (T377947)]], [[gerrit:1083967|StatsLib: Set label for wiki ID (T375496)]] [08:40:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: T378320', diff saved to https://phabricator.wikimedia.org/P70603 and previous config saved to /var/cache/conftool/dbconfig/20241029-084002-arnaudb.json [08:40:05] T377947: WikimediaEvents CI blocked with phan error - https://phabricator.wikimedia.org/T377947 [08:40:06] T375496: Temp accounts Grafana Dashboard: Edit rate for anonymous IP editors, named accounts, and temp accounts - https://phabricator.wikimedia.org/T375496 [08:40:11] T378320: db1169 replication broken - enwiki.pagelinks corruption - https://phabricator.wikimedia.org/T378320 [08:41:01] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16347 [08:41:19] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [08:41:20] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [08:41:33] (03CR) 10Volans: [C:03+2] sre.mysql.pool: fix check for spurious changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1084029 (owner: 10Volans) [08:41:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347 [08:41:41] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 9038 [08:42:01] (03PS1) 10Brouberol: Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) [08:42:03] (03PS1) 10Brouberol: mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) [08:42:04] (03PS1) 10Brouberol: airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) [08:42:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9038 [08:42:53] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 8966 [08:43:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8966 [08:43:51] (03CR) 10Novem Linguae: Enable electionadmin user group on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [08:44:05] (03PS43) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [08:44:06] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 56258 [08:44:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56258 [08:45:14] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1084027|Unblock CI (T377947)]], [[gerrit:1083967|StatsLib: Set label for wiki ID (T375496)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:45:19] T377947: WikimediaEvents CI blocked with phan error - https://phabricator.wikimedia.org/T377947 [08:45:19] T375496: Temp accounts Grafana Dashboard: Edit rate for anonymous IP editors, named accounts, and temp accounts - https://phabricator.wikimedia.org/T375496 [08:45:56] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 200478 [08:46:11] !log kharlan@deploy2002 kharlan: Continuing with sync [08:46:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 200478 [08:46:41] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16591 [08:47:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16591 [08:47:17] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264567 [08:47:20] (03Merged) 10jenkins-bot: sre.mysql.pool: fix check for spurious changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1084029 (owner: 10Volans) [08:47:23] (03CR) 10Nikerabbit: [C:04-1] tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [08:47:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264567 [08:48:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 56258 [08:49:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56258 [08:49:04] (03CR) 10Nikerabbit: tables-catalog: Add translate_cache table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [08:51:08] !log uploaded ircstream 1.0+wmf12u1 to apt.wikimedia.org T376014 [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:13] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [08:52:06] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [08:52:11] arnaudb@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [08:52:15] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [08:52:29] to wiki? [08:53:06] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084027|Unblock CI (T377947)]], [[gerrit:1083967|StatsLib: Set label for wiki ID (T375496)]] (duration: 13m 06s) [08:53:11] T377947: WikimediaEvents CI blocked with phan error - https://phabricator.wikimedia.org/T377947 [08:53:11] T375496: Temp accounts Grafana Dashboard: Edit rate for anonymous IP editors, named accounts, and temp accounts - https://phabricator.wikimedia.org/T375496 [08:54:14] kostajh: are you done with the deployments? [08:54:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:54:25] hashar: almost [08:54:29] ok :) [08:54:43] I can do MichaelG_WMF change if you want ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1083836 ) [08:54:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083890 (https://phabricator.wikimedia.org/T378334) (owner: 10Kosta Harlan) [08:54:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083836 (https://phabricator.wikimedia.org/T376677) (owner: 10Michael Große) [08:54:59] Thank you :) [08:55:00] hashar: I'm syncing it now [08:55:05] wonderful [08:55:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: T378320', diff saved to https://phabricator.wikimedia.org/P70604 and previous config saved to /var/cache/conftool/dbconfig/20241029-085507-arnaudb.json [08:55:09] !log upgrade irc.wikimedia.org to ircstream 1.0+wmf12u1 T376014 [08:55:17] T378320: db1169 replication broken - enwiki.pagelinks corruption - https://phabricator.wikimedia.org/T378320 [08:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:22] MichaelG_WMF: you don't need to verify anything with this change, right? [08:55:37] Nope, it only changes something for one beta wiki [08:55:38] (03Merged) 10jenkins-bot: temp accounts: Enable temp account autocreation on five pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083890 (https://phabricator.wikimedia.org/T378334) (owner: 10Kosta Harlan) [08:55:40] (03Merged) 10jenkins-bot: beta: enable "Surfacing structured tasks" for an early beta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083836 (https://phabricator.wikimedia.org/T376677) (owner: 10Michael Große) [08:56:08] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1083890|temp accounts: Enable temp account autocreation on five pilot wikis (T378334)]], [[gerrit:1083836|beta: enable "Surfacing structured tasks" for an early beta-wiki (T376677)]] [08:56:14] T378334: Temporary Accounts: Minor pilot - Oct 29 deploy - https://phabricator.wikimedia.org/T378334 [08:56:14] T376677: Surfacing Structured Tasks: set up a disabled-by-default feature flag - https://phabricator.wikimedia.org/T376677 [08:57:09] (03CR) 10Fabfur: haproxykafka: profile and hiera files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [08:57:32] (03PS18) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [08:58:26] !log kharlan@deploy2002 migr, kharlan: Backport for [[gerrit:1083890|temp accounts: Enable temp account autocreation on five pilot wikis (T378334)]], [[gerrit:1083836|beta: enable "Surfacing structured tasks" for an early beta-wiki (T376677)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:58:33] hashar: I'll ping you when finished. I need some minutes to verify the temp accounts change [08:58:54] take your time, I was only offering to sync the mediawiki-config change above [08:58:58] I have nothing to deploy [09:00:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [09:01:45] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10271304 (10MoritzMuehlenhoff) Given https://github.com/paravoid/ircstream/commit/9b08f3... [09:04:11] (03PS2) 10Tiziano Fogli: add kartik to deploy-ml-services group [puppet] - 10https://gerrit.wikimedia.org/r/1083773 (https://phabricator.wikimedia.org/T376585) [09:06:43] (03Abandoned) 10Urbanecm: [DNM] Test CI [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082516 (owner: 10Urbanecm) [09:06:47] (03CR) 10Tiziano Fogli: add kartik to deploy-ml-services group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1083773 (https://phabricator.wikimedia.org/T376585) (owner: 10Tiziano Fogli) [09:07:14] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:07:22] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:07:50] (03CR) 10Tiziano Fogli: [C:03+2] add kartik to deploy-ml-services group [puppet] - 10https://gerrit.wikimedia.org/r/1083773 (https://phabricator.wikimedia.org/T376585) (owner: 10Tiziano Fogli) [09:08:12] (03PS19) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [09:08:16] (03PS2) 10Tiziano Fogli: add gmodena to dumps-root [puppet] - 10https://gerrit.wikimedia.org/r/1083766 (https://phabricator.wikimedia.org/T377773) [09:09:51] (03CR) 10Tiziano Fogli: [C:03+2] add gmodena to dumps-root [puppet] - 10https://gerrit.wikimedia.org/r/1083766 (https://phabricator.wikimedia.org/T377773) (owner: 10Tiziano Fogli) [09:10:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [09:13:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance, host is not pooled [09:13:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance, host is not pooled [09:14:08] still need a few more minutes [09:15:51] (03PS1) 10Sergio Gimeno: Growth [test2wiki]: enable community updates module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) [09:15:54] kostajh: I'm around if I can help with anything. [09:16:02] (03PS44) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [09:16:10] Not seeing temp accounts yet on igwiki and itwiktionary [09:16:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1083768 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:16:15] !log kharlan@deploy2002 migr, kharlan: Continuing with sync [09:16:27] Niharika: enabling now [09:16:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:16:42] Can confirm that it is enabled on the expected beta-wiki for me. Thank you! [09:18:22] MichaelG_WMF: cool! [09:19:41] (03PS45) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [09:20:06] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:20:13] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:20:50] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1083890|temp accounts: Enable temp account autocreation on five pilot wikis (T378334)]], [[gerrit:1083836|beta: enable "Surfacing structured tasks" for an early beta-wiki (T376677)]] (duration: 24m 42s) [09:20:56] T378334: Temporary Accounts: Minor pilot - Oct 29 deploy - https://phabricator.wikimedia.org/T378334 [09:20:57] T376677: Surfacing Structured Tasks: set up a disabled-by-default feature flag - https://phabricator.wikimedia.org/T376677 [09:21:09] (03PS1) 10Slyngshede: P:idp rewrite tgt lookup logic for idp-logout script [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) [09:21:12] (03CR) 10Elukey: [C:03+1] Remove support for using ircstream with eventstream (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084026 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:21:15] Niharika: done [09:21:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:22:02] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:23:08] (03CR) 10CI reject: [V:04-1] P:idp rewrite tgt lookup logic for idp-logout script [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:23:32] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:23:40] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2211.codfw.wmnet onto db2223.codfw.wmnet [09:24:39] kostajh: Yep, I see them. Thanks. :) [09:26:05] (03CR) 10Slyngshede: [C:03+2] P:idp add Redis database and password configuration. [puppet] - 10https://gerrit.wikimedia.org/r/1083768 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:26:18] (03PS2) 10Muehlenhoff: Remove support for using ircstream with eventstream [puppet] - 10https://gerrit.wikimedia.org/r/1084026 (https://phabricator.wikimedia.org/T376014) [09:30:26] (03CR) 10Elukey: [C:03+1] Release new mesh.service module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084030 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [09:30:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10271396 (10tappof) 05Stalled→03Resolved Patch merged. @gmodena is now a member of dumps-root group. [09:30:38] (03CR) 10Muehlenhoff: [C:03+2] Remove support for using ircstream with eventstream [puppet] - 10https://gerrit.wikimedia.org/r/1084026 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [09:32:54] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Oct): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10271400 (10tappof) 05Stalled→03Resolved a:03tappof patch merged. @KartikMistry is now a membe... [09:35:27] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10271418 (10tappof) thank you @KFrancis. I'll move forward with the request. [09:37:39] (03PS20) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [09:37:46] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083973 (https://phabricator.wikimedia.org/T373695) (owner: 10Clare Ming) [09:37:53] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083972 (https://phabricator.wikimedia.org/T373695) (owner: 10Clare Ming) [09:38:11] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:38:22] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:38:32] (03CR) 10Vgutierrez: [C:03+2] liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:38:44] (03CR) 10Fabfur: [C:03+1] role,site: Provide a liberica role and use it on lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [09:38:57] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083973 (https://phabricator.wikimedia.org/T373695) (owner: 10Clare Ming) [09:39:16] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083972 (https://phabricator.wikimedia.org/T373695) (owner: 10Clare Ming) [09:39:54] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [09:40:27] (03CR) 10Elukey: "Left some nits but otherwise looks good to me! Let's have the buy-in from ServiceOps and we should be good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [09:41:31] (03CR) 10Elukey: [C:03+1] airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [09:41:41] !log UTC morning deploys done [09:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:44] (03PS2) 10Slyngshede: P:idp rewrite tgt lookup logic for idp-logout script [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) [09:44:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10271454 (10MoritzMuehlenhoff) [09:51:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#10271470 (10Vgutierrez) Gven the limitations to run pybal and liberica on the same hosts, we want to run liberica on separate hosts with a h... [09:52:13] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10271471 (10hnowlan) Just to note Joely has verified the SSH key in this ticket via slack [09:52:21] (03PS21) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [09:52:23] (03PS1) 10Slyngshede: R:idp-test: Enable Redis on all test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1084045 (https://phabricator.wikimedia.org/T377728) [09:53:56] (03PS22) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [09:53:58] jouncebot: nowandnext [09:53:58] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [09:53:59] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1000) [09:54:01] I'm going to stop puppet and muck around with envoy on mwdebug1001 [09:56:08] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [09:56:34] !log installing wireshark security updates [09:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4424/console" [puppet] - 10https://gerrit.wikimedia.org/r/1084045 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:58:35] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4425/co" [puppet] - 10https://gerrit.wikimedia.org/r/1084045 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1000) [10:04:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [10:05:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10271489 (10ops-monitoring-bot) Draining ganeti2015.codfw.wmnet of running VMs [10:07:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [10:07:25] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [10:07:43] (03PS13) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [10:07:43] (03PS23) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [10:08:51] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2211.codfw.wmnet onto db2223.codfw.wmnet [10:11:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:13:59] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454 (10MoritzMuehlenhoff) 03NEW [10:14:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [10:15:06] (03PS24) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [10:15:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10271544 (10ops-monitoring-bot) Draining ganeti2015.codfw.wmnet of running VMs [10:16:18] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti2015 [puppet] - 10https://gerrit.wikimedia.org/r/1084046 (https://phabricator.wikimedia.org/T376594) [10:18:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:18:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:23:38] (03PS14) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [10:24:01] (03PS25) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [10:24:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082878 (https://phabricator.wikimedia.org/T378067) (owner: 10Dreamrimmer) [10:26:40] (03PS2) 10Brouberol: mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) [10:26:40] (03PS2) 10Brouberol: airflow: add the 'component: webserver' to the tls Service label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084032 (https://phabricator.wikimedia.org/T378377) [10:26:46] (03CR) 10Brouberol: mesh.service: introduce a way to further specify the service label selectors (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [10:30:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:33] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:32:12] (03PS11) 10Hnowlan: services_proxy: add tcp_keepalive parameter, enable for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) [10:33:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082444 (https://phabricator.wikimedia.org/T377930) (owner: 10Superzerocool) [10:37:25] (03CR) 10Hnowlan: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [10:42:05] (03PS15) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [10:42:34] (03PS26) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [10:42:52] (03CR) 10CI reject: [V:04-1] haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:43:31] (03PS1) 10Elukey: profile::prometheus::ops: remove event-related config for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1084048 (https://phabricator.wikimedia.org/T376014) [10:44:05] (03CR) 10CI reject: [V:04-1] profile::prometheus::ops: remove event-related config for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1084048 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [10:45:03] (03PS2) 10Elukey: profile::prometheus::ops: remove event-related config for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1084048 (https://phabricator.wikimedia.org/T376014) [10:45:53] (03CR) 10Elukey: [C:03+1] "If you get the sign-off from ServiceOps you are good to go imho!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [10:46:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:48:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1084048 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [10:50:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2042.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:50:29] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:51:25] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 5.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:09] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52776 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2211.codfw.wmnet onto db2223.codfw.wmnet [10:53:41] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2042.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:54:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10271651 (10MoritzMuehlenhoff) [10:54:47] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [10:56:00] (03Abandoned) 10Arturo Borrero Gonzalez: openstack: designate: nova_fixed_multi: base: refactor record creation routine [puppet] - 10https://gerrit.wikimedia.org/r/1083820 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [10:56:12] (03CR) 10Elukey: [C:03+2] profile::prometheus::ops: remove event-related config for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1084048 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [10:58:40] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:59:05] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:59:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2044.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:01:48] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:01:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:02:58] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10271673 (10tappof) [11:05:24] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:05:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:07:00] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10271693 (10tappof) 05Open→03Stalled [11:07:19] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:07:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:08:47] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10271688 (10tappof) 05Open→03Stalled @thcipriani your approval is needed for the deployment group. Thanks. [11:09:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:09:23] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:09:59] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:10:08] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:10:35] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:10:59] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:11:43] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:11:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:14:22] 06SRE, 10Wikimedia-Mailing-lists: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#10271733 (10Ladsgroup) Random note: If the email is disabled. Mailman after a while automatically removes them. It was broken for a while but it has been fixed for years now. [11:15:15] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:15:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:16:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:16:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:17:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10271737 (10jcrespo) This is not great for me. I need these hosts by 2024-09-08 as documented at T368926, and I am running out of space. If I start res... [11:17:37] jouncebot: nowandnext [11:17:37] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [11:17:38] In 0 hour(s) and 42 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1200) [11:18:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:18:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:19:50] (03PS1) 10Muehlenhoff: Remove obsolete stub cert for config-master [labs/private] - 10https://gerrit.wikimedia.org/r/1084058 (https://phabricator.wikimedia.org/T357750) [11:20:50] (03PS7) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [11:21:40] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:21:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:22:01] (03CR) 10Clément Goubert: [V:03+2 C:03+2] "Last PS is just a rebase with conflicts due to the weekly rebuild, considering previous +1 as still valid." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [11:22:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:23:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:23:32] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:23:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:24:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:24:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:25:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:25:25] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:26:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:26:57] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:27:37] !log Rebuilding php{7.4,8.1}-fpm-multiversion-base - T377958 [11:27:39] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:42] T377958: Add helper script functionality to our php images - https://phabricator.wikimedia.org/T377958 [11:27:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:28:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:28:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:29:02] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:29:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:29:41] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:29:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:30:17] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:30:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:32:12] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2036.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:32:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2036.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:33:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:33:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:34:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2038.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:34:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2038.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:35:17] (03PS1) 10Slyngshede: Blocklog: Show the username of the admin on the public log. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084063 (https://phabricator.wikimedia.org/T376991) [11:35:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2039.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:35:26] !log cgoubert@deploy2002 Started scap sync-world: T377958 - full mediawiki image rebuild and deployment to add helper scripts for mwcron, mwscript [11:35:34] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2039.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:35:54] T377958: Add helper script functionality to our php images - https://phabricator.wikimedia.org/T377958 [11:36:04] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2040.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:36:17] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2040.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:36:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2041.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:36:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2041.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:37:16] (03PS16) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [11:38:50] (03PS2) 10Hamish: annwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083952 (https://phabricator.wikimedia.org/T377535) [11:39:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2044.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:39:48] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2044.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:40:28] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10271812 (10elukey) [11:41:19] (03PS3) 10Hamish: annwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083952 (https://phabricator.wikimedia.org/T377535) [11:50:43] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10271822 (10elukey) All the licenses are applied, the last steps are to run the provision cookbook on all nodes. [11:51:10] (03Abandoned) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 (owner: 10Elukey) [11:56:26] (03PS17) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [11:56:49] (03PS27) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [11:58:11] (03PS1) 10Hamish: rskwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 [11:58:34] (03PS1) 10Máté Szabó: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084072 (https://phabricator.wikimedia.org/T375881) [11:59:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1200) [12:02:43] (03CR) 10Ladsgroup: [C:03+1] Remove unused wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083288 (https://phabricator.wikimedia.org/T336004) (owner: 10Krinkle) [12:04:19] !log cgoubert@deploy2002 Finished scap sync-world: T377958 - full mediawiki image rebuild and deployment to add helper scripts for mwcron, mwscript (duration: 29m 44s) [12:04:43] (03CR) 10Fabfur: "Sorry for the amount of patchsets, I think this should be ready for reviews" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [12:04:56] T377958: Add helper script functionality to our php images - https://phabricator.wikimedia.org/T377958 [12:05:04] (03PS1) 10Mvolz: Use 0 workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084076 [12:08:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [12:11:25] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:11:33] (03PS1) 10Hamish: tddwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084079 (https://phabricator.wikimedia.org/T377537) [12:12:29] (03PS2) 10Hamish: rskwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 (https://phabricator.wikimedia.org/T377536) [12:13:21] (03PS3) 10Hamish: rskwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 (https://phabricator.wikimedia.org/T377536) [12:19:01] PROBLEM - ganeti-confd running on ganeti2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:19:01] PROBLEM - ganeti-noded running on ganeti2015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:19:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10272038 (10MoritzMuehlenhoff) [12:19:49] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti2015 [puppet] - 10https://gerrit.wikimedia.org/r/1084046 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [12:20:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10272033 (10MoritzMuehlenhoff) [12:21:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:08] (03CR) 10Kamila Součková: [C:03+1] "low-confidence +1 from me -- I haven't been following the migration in detail, but it's probably fine? :D" [puppet] - 10https://gerrit.wikimedia.org/r/1083776 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [12:22:30] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert for config-master [labs/private] - 10https://gerrit.wikimedia.org/r/1084058 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:22:35] (03PS1) 10Hamish: ibawiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084088 (https://phabricator.wikimedia.org/T377538) [12:26:26] (03PS1) 10Hamish: moswiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084090 [12:27:35] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@d85a93c]: (no justification provided) [12:28:04] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@d85a93c]: (no justification provided) (duration: 00m 30s) [12:32:14] (03PS1) 10Hamish: gorwikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084094 (https://phabricator.wikimedia.org/T377542) [12:33:00] (03PS2) 10Hamish: moswiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084090 (https://phabricator.wikimedia.org/T377539) [12:35:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) (owner: 10MacFan4000) [12:37:59] (03PS2) 10Hamish: gorwikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084094 (https://phabricator.wikimedia.org/T377542) [12:38:03] (03PS1) 10Zabe: ExtensionDistributor: Remove EOL 1.40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084096 (https://phabricator.wikimedia.org/T364989) [12:41:07] (03PS1) 10Hamish: shnwikinews: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084097 (https://phabricator.wikimedia.org/T377543) [12:41:10] (03PS1) 10Gmodena: data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 [12:41:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084096 (https://phabricator.wikimedia.org/T364989) (owner: 10Zabe) [12:41:38] (03CR) 10Jforrester: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084096 (https://phabricator.wikimedia.org/T364989) (owner: 10Zabe) [12:42:43] !log Killed dead and stacked import-wikitech.sh processes on wikitech-static - T374114 [12:42:44] (03PS2) 10Gmodena: data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 (https://phabricator.wikimedia.org/T376713) [12:42:44] (03CR) 10CI reject: [V:04-1] data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 (https://phabricator.wikimedia.org/T376713) (owner: 10Gmodena) [12:42:48] !log Manually relaunched import-wikitech.sh on wikitech-static - T374114 [12:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:06] T374114: Review/update wikitech-static syncing after wikitech moves to Kubernetes - https://phabricator.wikimedia.org/T374114 [12:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10272184 (10MoritzMuehlenhoff) [12:43:55] (03CR) 10CI reject: [V:04-1] data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 (https://phabricator.wikimedia.org/T376713) (owner: 10Gmodena) [12:44:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70606 and previous config saved to /var/cache/conftool/dbconfig/20241029-124440-arnaudb.json [12:44:53] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10272186 (10MoritzMuehlenhoff) [12:45:20] (03PS11) 10Ayounsi: WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [12:46:23] (03PS1) 10Hamish: kgewiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084100 (https://phabricator.wikimedia.org/T377075) [12:46:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:26] (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084101 (https://phabricator.wikimedia.org/T360636) [12:49:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083294 (https://phabricator.wikimedia.org/T377648) (owner: 10Hamish) [12:49:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083952 (https://phabricator.wikimedia.org/T377535) (owner: 10Hamish) [12:49:45] (03CR) 10CI reject: [V:04-1] WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [12:49:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084100 (https://phabricator.wikimedia.org/T377075) (owner: 10Hamish) [12:50:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084097 (https://phabricator.wikimedia.org/T377543) (owner: 10Hamish) [12:50:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084094 (https://phabricator.wikimedia.org/T377542) (owner: 10Hamish) [12:50:27] !log installing Apache security updates [12:50:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084090 (https://phabricator.wikimedia.org/T377539) (owner: 10Hamish) [12:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084088 (https://phabricator.wikimedia.org/T377538) (owner: 10Hamish) [12:50:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 (https://phabricator.wikimedia.org/T377536) (owner: 10Hamish) [12:50:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084079 (https://phabricator.wikimedia.org/T377537) (owner: 10Hamish) [12:51:48] (03PS1) 10Muehlenhoff: Remove wikitech stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1084104 (https://phabricator.wikimedia.org/T371878) [12:52:08] (03PS1) 10Ayounsi: WIP: wmf-netbox - expose interfaces in a SR-Linux format [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1084105 (https://phabricator.wikimedia.org/T371088) [12:52:29] (03PS12) 10Ayounsi: WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T371088) [12:53:27] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox - expose interfaces in a SR-Linux format [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1084105 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [12:53:27] 10SRE-Access-Requests: Access to ops mailing list - https://phabricator.wikimedia.org/T378484 (10zoe) 03NEW [12:54:10] (03PS2) 10Hamish: shnwikinews: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084097 (https://phabricator.wikimedia.org/T377543) [12:56:43] (03CR) 10CI reject: [V:04-1] WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [12:57:07] (03PS1) 10Ayounsi: WIP: example config for Nokia SR-Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) [12:57:43] (03CR) 10CI reject: [V:04-1] WIP: example config for Nokia SR-Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [12:59:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70607 and previous config saved to /var/cache/conftool/dbconfig/20241029-125945-arnaudb.json [13:00:04] Urbanecm and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1300). [13:00:05] DreamRimmer, Superzerocool, James_F, and Hamishcz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] * James_F waves. [13:00:30] * Hamishcz says hi [13:01:21] (03CR) 10LMata: [C:03+1] icinga: Remove external monitoring rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075615 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [13:02:35] I can deploy, I suppose. [13:03:11] Hamishcz: You've put a lot of commits. Normally the total limit is 6. [13:03:29] 10SRE-swift-storage, 06Commons, 10MediaWiki-extensions-Nuke, 06Moderator-Tools-Team: Double-deletion on Commons - https://phabricator.wikimedia.org/T173825#10272285 (10Samwalton9-WMF) [13:03:40] ah, im sorry as this is my first time to deploy mass patches here.... [13:03:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) (owner: 10MacFan4000) [13:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084096 (https://phabricator.wikimedia.org/T364989) (owner: 10Zabe) [13:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083294 (https://phabricator.wikimedia.org/T377648) (owner: 10Hamish) [13:04:09] I'll do mine and your first one. I'll do all the logos as a second mass one, I suppose. [13:04:27] no problem, appreciate [13:04:33] (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) (owner: 10MacFan4000) [13:04:35] (03Merged) 10jenkins-bot: ExtensionDistributor: Remove EOL 1.40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084096 (https://phabricator.wikimedia.org/T364989) (owner: 10Zabe) [13:04:38] (03Merged) 10jenkins-bot: enwiktionary: Enable mobile page tabs for non logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083294 (https://phabricator.wikimedia.org/T377648) (owner: 10Hamish) [13:05:04] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1082256|ExtensionDistributor: Mark 1.43 as beta (T372322)]], [[gerrit:1084096|ExtensionDistributor: Remove EOL 1.40 (T364989)]], [[gerrit:1083294|enwiktionary: Enable mobile page tabs for non logged in users (T377648)]] [13:05:41] T372322: Add REL1_43 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T372322 [13:05:41] T364989: Formally EOL 1.40 - https://phabricator.wikimedia.org/T364989 [13:05:42] T377648: Enable portlet links for logged-out users on the English Wiktionary - https://phabricator.wikimedia.org/T377648 [13:07:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [13:10:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [13:10:41] !log jforrester@deploy2002 zabe, macfan4000, hamishz, jforrester: Backport for [[gerrit:1082256|ExtensionDistributor: Mark 1.43 as beta (T372322)]], [[gerrit:1084096|ExtensionDistributor: Remove EOL 1.40 (T364989)]], [[gerrit:1083294|enwiktionary: Enable mobile page tabs for non logged in users (T377648)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:51] Yup, all working. [13:10:53] !log jforrester@deploy2002 zabe, macfan4000, hamishz, jforrester: Continuing with sync [13:11:21] T372322: Add REL1_43 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T372322 [13:11:21] T364989: Formally EOL 1.40 - https://phabricator.wikimedia.org/T364989 [13:11:22] T377648: Enable portlet links for logged-out users on the English Wiktionary - https://phabricator.wikimedia.org/T377648 [13:14:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70610 and previous config saved to /var/cache/conftool/dbconfig/20241029-131451-arnaudb.json [13:15:05] (03CR) 10Tchanders: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084072 (https://phabricator.wikimedia.org/T375881) (owner: 10Máté Szabó) [13:16:07] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084072 (https://phabricator.wikimedia.org/T375881) (owner: 10Máté Szabó) [13:17:46] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082256|ExtensionDistributor: Mark 1.43 as beta (T372322)]], [[gerrit:1084096|ExtensionDistributor: Remove EOL 1.40 (T364989)]], [[gerrit:1083294|enwiktionary: Enable mobile page tabs for non logged in users (T377648)]] (duration: 12m 41s) [13:18:06] yes confirmed working [13:18:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083952 (https://phabricator.wikimedia.org/T377535) (owner: 10Hamish) [13:18:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084100 (https://phabricator.wikimedia.org/T377075) (owner: 10Hamish) [13:18:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084097 (https://phabricator.wikimedia.org/T377543) (owner: 10Hamish) [13:18:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084094 (https://phabricator.wikimedia.org/T377542) (owner: 10Hamish) [13:18:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084090 (https://phabricator.wikimedia.org/T377539) (owner: 10Hamish) [13:18:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084088 (https://phabricator.wikimedia.org/T377538) (owner: 10Hamish) [13:18:14] T372322: Add REL1_43 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T372322 [13:18:15] T364989: Formally EOL 1.40 - https://phabricator.wikimedia.org/T364989 [13:18:15] T377648: Enable portlet links for logged-out users on the English Wiktionary - https://phabricator.wikimedia.org/T377648 [13:18:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 (https://phabricator.wikimedia.org/T377536) (owner: 10Hamish) [13:18:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084079 (https://phabricator.wikimedia.org/T377537) (owner: 10Hamish) [13:19:10] 8 in one go is a lot, even for me. [13:19:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [13:19:28] (03Merged) 10jenkins-bot: moswiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084090 (https://phabricator.wikimedia.org/T377539) (owner: 10Hamish) [13:19:30] (03Merged) 10jenkins-bot: ibawiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084088 (https://phabricator.wikimedia.org/T377538) (owner: 10Hamish) [13:19:33] (03Merged) 10jenkins-bot: rskwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084069 (https://phabricator.wikimedia.org/T377536) (owner: 10Hamish) [13:19:35] (03Merged) 10jenkins-bot: tddwiki: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084079 (https://phabricator.wikimedia.org/T377537) (owner: 10Hamish) [13:20:04] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1083952|annwiki: Add logo (T377535)]], [[gerrit:1084100|kgewiki: Add logo (T377075)]], [[gerrit:1084097|shnwikinews: Add logo (T377543)]], [[gerrit:1084094|gorwikiquote: Add logo (T377542)]], [[gerrit:1084090|moswiki: Add logo (T377539)]], [[gerrit:1084088|ibawiki: Add logo (T377538)]], [[gerrit:1084069|rskwiki: Add logo (T377536)]], [[gerrit:108407 [13:20:04] 9|tddwiki: Add logo (T377537)]] [13:20:43] T377535: Set logo for annwiki - https://phabricator.wikimedia.org/T377535 [13:20:44] T377075: Add logo, wordmark, and tagline for kgewiki - https://phabricator.wikimedia.org/T377075 [13:20:44] T377543: Set logo for shnwikinews - https://phabricator.wikimedia.org/T377543 [13:20:44] T377542: Set logo for gorwikiquote - https://phabricator.wikimedia.org/T377542 [13:20:45] T377539: Set logo for moswiki - https://phabricator.wikimedia.org/T377539 [13:20:45] T377538: Set logo for ibawiki - https://phabricator.wikimedia.org/T377538 [13:20:45] T377536: Set logo for rskwiki - https://phabricator.wikimedia.org/T377536 [13:20:46] T377537: Set logo for tddwiki - https://phabricator.wikimedia.org/T377537 [13:22:11] I will pay attention to the limit in the future...sorry for any inconvenience [13:22:31] !log jforrester@deploy2002 jforrester, hamishz: Backport for [[gerrit:1083952|annwiki: Add logo (T377535)]], [[gerrit:1084100|kgewiki: Add logo (T377075)]], [[gerrit:1084097|shnwikinews: Add logo (T377543)]], [[gerrit:1084094|gorwikiquote: Add logo (T377542)]], [[gerrit:1084090|moswiki: Add logo (T377539)]], [[gerrit:1084088|ibawiki: Add logo (T377538)]], [[gerrit:1084069|rskwiki: Add logo (T377536)]], [[gerrit:1084079|td [13:22:31] dwiki: Add logo (T377537)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:39] OK, they're live on mw-debug. [13:22:41] * James_F checks. [13:23:40] (03CR) 10CDanis: "one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [13:23:44] I've checked a few and they're all fine. Will proceed. [13:23:46] !log jforrester@deploy2002 jforrester, hamishz: Continuing with sync [13:24:09] * Superzerocool waves ;( [13:24:21] (03CR) 10Brouberol: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) (owner: 10Brouberol) [13:24:35] Hey Superzerocool, I'll do you after this block. [13:24:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [13:26:30] !log mszabo@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [13:27:09] !log mszabo@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:28:02] !log mszabo@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:28:29] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1083952|annwiki: Add logo (T377535)]], [[gerrit:1084100|kgewiki: Add logo (T377075)]], [[gerrit:1084097|shnwikinews: Add logo (T377543)]], [[gerrit:1084094|gorwikiquote: Add logo (T377542)]], [[gerrit:1084090|moswiki: Add logo (T377539)]], [[gerrit:1084088|ibawiki: Add logo (T377538)]], [[gerrit:1084069|rskwiki: Add logo (T377536)]], [[gerrit:10840 [13:28:29] 79|tddwiki: Add logo (T377537)]] (duration: 08m 25s) [13:28:37] !log mszabo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:29:09] I've checked all and patches are likely to be all good [13:29:10] !log mszabo@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [13:29:11] Okie-dokie, over to Superzerocool (and DreamRimmer). [13:29:15] Hamishcz: Thanks! [13:29:28] James_F: thank you so much :) [13:29:31] !log mszabo@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [13:29:33] Of course! [13:29:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082878 (https://phabricator.wikimedia.org/T378067) (owner: 10Dreamrimmer) [13:29:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082444 (https://phabricator.wikimedia.org/T377930) (owner: 10Superzerocool) [13:29:50] T377535: Set logo for annwiki - https://phabricator.wikimedia.org/T377535 [13:29:51] T377075: Add logo, wordmark, and tagline for kgewiki - https://phabricator.wikimedia.org/T377075 [13:29:51] T377543: Set logo for shnwikinews - https://phabricator.wikimedia.org/T377543 [13:29:51] T377542: Set logo for gorwikiquote - https://phabricator.wikimedia.org/T377542 [13:29:52] T377539: Set logo for moswiki - https://phabricator.wikimedia.org/T377539 [13:29:52] T377538: Set logo for ibawiki - https://phabricator.wikimedia.org/T377538 [13:29:52] T377536: Set logo for rskwiki - https://phabricator.wikimedia.org/T377536 [13:29:53] T377537: Set logo for tddwiki - https://phabricator.wikimedia.org/T377537 [13:29:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70612 and previous config saved to /var/cache/conftool/dbconfig/20241029-132956-arnaudb.json [13:30:00] (03PS2) 10Jforrester: wikitech: Stop loading the i18n for LdapAuthentication, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) [13:30:21] (03Merged) 10jenkins-bot: Allow admins on testwiki to grant and remove upwizcampeditors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082878 (https://phabricator.wikimedia.org/T378067) (owner: 10Dreamrimmer) [13:30:25] (03Merged) 10jenkins-bot: nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082444 (https://phabricator.wikimedia.org/T377930) (owner: 10Superzerocool) [13:30:51] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1082878|Allow admins on testwiki to grant and remove upwizcampeditors (T378067)]], [[gerrit:1082444|nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T377930)]] [13:31:16] ah.. found an error........... [13:32:24] T378067: Allow admins on testwiki to grant and remove upwizcampeditors - https://phabricator.wikimedia.org/T378067 [13:32:24] T377930: Lift of IP Cap Request (middle of November 2024) - https://phabricator.wikimedia.org/T377930 [13:33:12] !log jforrester@deploy2002 dreamrimmer, superzerocool, jforrester: Backport for [[gerrit:1082878|Allow admins on testwiki to grant and remove upwizcampeditors (T378067)]], [[gerrit:1082444|nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T377930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:26] <3 [13:33:33] thanks Hamishcz :) [13:33:35] (03CR) 10Ssingh: [C:03+1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:34:04] maybe you want to thank James_F ? [13:34:06] lol [13:34:10] :-) [13:34:12] xD [13:34:15] !log jforrester@deploy2002 dreamrimmer, superzerocool, jforrester: Continuing with sync [13:34:35] well well... thanks James_F =) /me sending cookis :) [13:34:39] cookies* [13:36:14] (03CR) 10CDanis: [C:03+1] "+1 on the plan, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1083776 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [13:36:43] James_F: ibawiki's tagline cannot display due to path error, you want to revert the previous one and do a new one, or I just revise the path based on current patch? [13:37:16] Hmm, let me look [13:37:32] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: haproxykafka module (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [13:37:49] How odd. [13:37:57] Do you have a fix? [13:38:03] yes [13:38:16] Please push it and I'll deploy. [13:38:22] ok [13:38:51] (03PS1) 10Hamish: fix ibawiki's tagline svg path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084110 [13:38:53] (03PS1) 10Kosta Harlan: AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084111 (https://phabricator.wikimedia.org/T375505) [13:38:54] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082878|Allow admins on testwiki to grant and remove upwizcampeditors (T378067)]], [[gerrit:1082444|nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T377930)]] (duration: 08m 03s) [13:39:07] ^ that one [13:39:09] Oh, right. [13:39:09] (03PS1) 10Kosta Harlan: AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084112 (https://phabricator.wikimedia.org/T375505) [13:39:12] (03CR) 10Jforrester: [C:03+2] fix ibawiki's tagline svg path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084110 (owner: 10Hamish) [13:39:28] T378067: Allow admins on testwiki to grant and remove upwizcampeditors - https://phabricator.wikimedia.org/T378067 [13:39:28] T377930: Lift of IP Cap Request (middle of November 2024) - https://phabricator.wikimedia.org/T377930 [13:39:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084110 (owner: 10Hamish) [13:39:45] daylight confusion time, I thought the window started in 20 minutes :( [13:39:52] (03Merged) 10jenkins-bot: fix ibawiki's tagline svg path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084110 (owner: 10Hamish) [13:40:08] kostajh: I blame George Bush. [13:40:16] I'd like to self-serve two patches if there's still time when you're finished, James_F [13:40:17] lol [13:40:19] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1084110|fix ibawiki's tagline svg path]] [13:40:47] kostajh: Sure. [13:41:53] James_F: good now [13:42:08] and the others logos are good as well FYI [13:42:13] !log installing ghoscript security updates [13:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] James_F: thx. please ping me when you're finished. I've added my patches to the calendar [13:42:43] !log jforrester@deploy2002 jforrester, hamishz: Backport for [[gerrit:1084110|fix ibawiki's tagline svg path]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:48] kostajh: Did you want me to deploy? [13:43:22] !log jforrester@deploy2002 jforrester, hamishz: Continuing with sync [13:45:07] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16347 [13:45:15] (03CR) 10CDanis: otelcol-contrib: add tail_sampling config for thanos-query (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) (owner: 10Herron) [13:45:20] (03CR) 10Jforrester: [C:03+2] AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084112 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:45:21] (03CR) 10Jforrester: [C:03+2] AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084111 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:45:34] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 16347 [13:45:43] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 16347 [13:45:47] kostajh: Over to you; my deploy sync will be done in a few minutes, well before your backports have finished CI. [13:46:11] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: profile and hiera files (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [13:46:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16347 [13:48:00] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084110|fix ibawiki's tagline svg path]] (duration: 07m 41s) [13:48:15] (03PS1) 10Fabfur: haproxy: add ring support to configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [13:50:13] James_F: thx [13:50:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084112 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:50:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084111 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:52:02] (03PS1) 10Stevemunene: Adjust spark-history rediness probe delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084114 (https://phabricator.wikimedia.org/T378497) [13:55:44] (03PS2) 10Stevemunene: Adjust spark-history rediness probe delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084114 (https://phabricator.wikimedia.org/T378497) [13:55:55] jouncebot: nowandnext [13:55:55] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1300) [13:55:55] In 1 hour(s) and 4 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1500) [13:56:15] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10272741 (10ssingh) Hi @RobH: thanks for writing this up. The instructions, hostnames (and serial numbers) all look good. The date/time also work for Traf... [13:56:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [13:56:43] (03Merged) 10jenkins-bot: AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084112 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:56:53] (03Merged) 10jenkins-bot: AuthManagerStatsdHandler: Add label for wiki [extensions/WikimediaEvents] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084111 (https://phabricator.wikimedia.org/T375505) (owner: 10Kosta Harlan) [13:57:25] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1084112|AuthManagerStatsdHandler: Add label for wiki (T375505)]], [[gerrit:1084111|AuthManagerStatsdHandler: Add label for wiki (T375505)]] [13:57:41] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [13:57:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [13:59:45] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1084112|AuthManagerStatsdHandler: Add label for wiki (T375505)]], [[gerrit:1084111|AuthManagerStatsdHandler: Add label for wiki (T375505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:11] !log kharlan@deploy2002 kharlan: Continuing with sync [14:01:06] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:01:16] (03PS2) 10Majavah: P:mediawiki: Stop trying to run jobs on s11 [puppet] - 10https://gerrit.wikimedia.org/r/1083584 (https://phabricator.wikimedia.org/T378260) [14:01:18] (03CR) 10Ladsgroup: [C:03+2] P:mediawiki: Stop trying to run jobs on s11 [puppet] - 10https://gerrit.wikimedia.org/r/1083584 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [14:01:31] (03CR) 10Ladsgroup: [V:03+2 C:03+2] P:mediawiki: Stop trying to run jobs on s11 [puppet] - 10https://gerrit.wikimedia.org/r/1083584 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [14:02:08] (03CR) 10Ssingh: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1083778 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [14:02:59] PROBLEM - Host ml-lab1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:10] (03PS18) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [14:04:24] (03CR) 10Fabfur: haproxykafka: haproxykafka module (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:05:19] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084112|AuthManagerStatsdHandler: Add label for wiki (T375505)]], [[gerrit:1084111|AuthManagerStatsdHandler: Add label for wiki (T375505)]] (duration: 07m 53s) [14:05:45] (03CR) 10Reedy: [C:03+2] SpecialGadgets: Replace deprecated SkinFactory::getSkinNames() call [extensions/Gadgets] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083962 (https://phabricator.wikimedia.org/T377521) (owner: 10Reedy) [14:05:48] (03CR) 10Reedy: [C:03+2] ChangeSkinPref: Replace deprecated SkinFactory::getSkinNames() call [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083963 (https://phabricator.wikimedia.org/T377521) (owner: 10Reedy) [14:05:55] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [14:06:16] !log UTC afternoon deploys done [14:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:25] (03PS19) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [14:07:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:07:24] (03PS28) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [14:08:27] RECOVERY - Host ml-lab1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:08:29] (03PS1) 10Elukey: sre.hosts.provision: wait for 5 mins after rebooting a Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1084119 [14:09:42] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2036.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:10:54] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: haproxykafka module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:11:29] PROBLEM - Host ganeti2036 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:41] (03CR) 10Elukey: "Still test-cookbooking this, I noticed something weird namely that after the first chassis reset sometimes the new settings just applied (" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084119 (owner: 10Elukey) [14:12:49] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:57] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: haproxykafka module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:13:59] RECOVERY - Host ganeti2036 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [14:14:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2036.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:15:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [14:15:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [14:15:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T376905)', diff saved to https://phabricator.wikimedia.org/P70614 and previous config saved to /var/cache/conftool/dbconfig/20241029-141532-ladsgroup.json [14:15:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:16:25] (03PS5) 10Ssingh: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [14:16:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [14:17:33] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4426/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [14:17:35] PROBLEM - Host ml-lab1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [14:18:10] (03PS29) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [14:18:26] (03CR) 10Fabfur: haproxykafka: profile and hiera files (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:18:36] !log restart rsyslog on centrallog1002 - connection errors, failing prometheus probes [14:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:03] RECOVERY - Host ml-lab1002 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [14:20:52] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:21:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:21:28] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10272859 (10Ladsgroup) Considering https://meta.wikimedia.org/wiki/Mailing_lists/Standardization it should be `wikipedia-ru-oversighters@lists.wikimedia.org`. I would set it n... [14:23:23] (03PS2) 10Elukey: sre.hosts.provision: wait for 5 mins after rebooting a Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1084119 [14:24:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T376905)', diff saved to https://phabricator.wikimedia.org/P70615 and previous config saved to /var/cache/conftool/dbconfig/20241029-142405-ladsgroup.json [14:24:14] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:25:27] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10272871 (10elukey) [14:25:49] !log T372337 clearing dangling database-records for link suggestions by running `mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=eswiki --db-table --force` [14:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:13] PROBLEM - Host ganeti2037 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:18] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [14:26:23] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:38] (03PS20) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [14:26:55] (03CR) 10Fabfur: haproxykafka: haproxykafka module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:27:12] (03PS30) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [14:27:49] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:25] RECOVERY - Host ganeti2037 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [14:29:28] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:29:44] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:30:02] (03CR) 10Elukey: [V:03+1 C:03+2] role::aux_k8s::{master,worker}: add support for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1083776 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [14:30:41] (03CR) 10Herron: otelcol-contrib: add tail_sampling config for thanos-query (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) (owner: 10Herron) [14:31:22] (03Merged) 10jenkins-bot: SpecialGadgets: Replace deprecated SkinFactory::getSkinNames() call [extensions/Gadgets] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083962 (https://phabricator.wikimedia.org/T377521) (owner: 10Reedy) [14:31:22] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:31:24] (03PS2) 10Herron: otelcol-contrib: add tail_sampling config for thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) [14:31:24] (03Merged) 10jenkins-bot: ChangeSkinPref: Replace deprecated SkinFactory::getSkinNames() call [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1083963 (https://phabricator.wikimedia.org/T377521) (owner: 10Reedy) [14:31:25] PROBLEM - Host ganeti2037 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:31] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:34] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) (owner: 10Herron) [14:31:51] (03CR) 10Herron: otelcol-contrib: add tail_sampling config for thanos-query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) (owner: 10Herron) [14:32:49] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:52] !log reedy@deploy2002 Started scap sync-world: 1.44.0-wmf.1 backports to fix deprecated logspam T375660 T377521 [14:33:25] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [14:33:25] T377521: Remove deprecated skin methods, hard deprecate soft deprecated methods - https://phabricator.wikimedia.org/T377521 [14:34:01] RECOVERY - Host ganeti2037 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [14:34:03] (03CR) 10Herron: [C:03+2] otelcol-contrib: add tail_sampling config for thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/1083892 (https://phabricator.wikimedia.org/T378190) (owner: 10Herron) [14:34:56] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1003.eqiad.wmnet with OS bookworm [14:34:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2037.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:35:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [14:35:53] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms [14:36:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:34] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:38:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:38:55] !log centrallog1002:~# systemctl restart rsyslogd [14:39:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P70616 and previous config saved to /var/cache/conftool/dbconfig/20241029-143912-ladsgroup.json [14:39:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv6: Connect - aux-k8s-eqiad, AS64610/IPv4: Connect - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv6: Connect - aux-k8s-eqiad, AS64610/IPv4: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [14:40:13] !log reedy@deploy2002 Finished scap sync-world: 1.44.0-wmf.1 backports to fix deprecated logspam T375660 T377521 (duration: 07m 21s) [14:40:40] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [14:40:40] T377521: Remove deprecated skin methods, hard deprecate soft deprecated methods - https://phabricator.wikimedia.org/T377521 [14:40:57] herron: o/ I did it as well 10 mins ago :D [14:41:03] PROBLEM - Host dse-k8s-worker1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:03] was it still broken? [14:41:20] (03CR) 10Vgutierrez: haproxykafka: haproxykafka module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:43:31] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: wait for 5 mins after rebooting a Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1084119 (owner: 10Elukey) [14:43:37] RECOVERY - Host dse-k8s-worker1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:44:10] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:44:33] (03PS1) 10Ayounsi: Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 [14:45:57] elukey: thanks! hmm it looks better now at any rate, I saw it firing a few times [14:46:30] (03PS21) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [14:46:43] (03CR) 10Ayounsi: "https://phabricator.wikimedia.org/P70618" [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi) [14:47:09] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2038.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:47:56] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10272995 (10Ladsgroup) @MBH Can you send me an email with list of email addresses you want as admin? I create the mailing list once that's done. [14:49:29] PROBLEM - Host ganeti2038 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:28] (03CR) 10CI reject: [V:04-1] Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 (owner: 10Ayounsi) [14:50:50] (03PS22) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [14:51:21] (03CR) 10Vgutierrez: [C:03+1] "BTW, a spec test here would be pretty useful as it would prevent from merging a wrong configuration on hiera" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:51:31] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:59] RECOVERY - Host ganeti2038 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [14:52:17] (03CR) 10Fabfur: haproxykafka: haproxykafka module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:52:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2038.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:52:27] (03PS31) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [14:52:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1084063 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [14:52:35] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1003.eqiad.wmnet with reason: host reimage [14:52:49] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:53:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [14:54:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P70619 and previous config saved to /var/cache/conftool/dbconfig/20241029-145419-ladsgroup.json [14:55:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1003.eqiad.wmnet with reason: host reimage [14:55:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [14:55:12] (03PS2) 10Majavah: Drop labtestwikitech name [dns] - 10https://gerrit.wikimedia.org/r/1083306 (https://phabricator.wikimedia.org/T378260) [14:56:42] (03CR) 10Majavah: [C:03+2] Drop labtestwikitech name [dns] - 10https://gerrit.wikimedia.org/r/1083306 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [14:56:43] jouncebot: nowandnext [14:56:43] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [14:56:43] In 0 hour(s) and 3 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1500) [14:57:09] (03PS1) 10Arnaudb: mariadb: productionize db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1084128 (https://phabricator.wikimedia.org/T373579) [14:57:09] (03CR) 10Arnaudb: "Please note that T378503 has been opened while preparing this CR" [puppet] - 10https://gerrit.wikimedia.org/r/1084128 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:59:38] messing about on mwdebug1001 again, puppet will be disabled and there might be some brief mesh issues [15:00:05] eoghan, jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1500). [15:00:19] !log Running php maintenance/deleteArchivedFiles.php --delete on wikitech-static - T374114 [15:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:57] T374114: Review/update wikitech-static syncing after wikitech moves to Kubernetes - https://phabricator.wikimedia.org/T374114 [15:02:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:57] (03PS2) 10Ayounsi: Add RIPE RIS sessions to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/1084125 [15:04:58] (03PS1) 10Ayounsi: Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 [15:05:36] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [15:05:36] (03CR) 10CI reject: [V:04-1] Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [15:06:23] (03PS2) 10Ayounsi: Add BGP.tools sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 [15:06:37] (03CR) 10Brouberol: [C:03+1] Adjust spark-history rediness probe delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084114 (https://phabricator.wikimedia.org/T378497) (owner: 10Stevemunene) [15:07:07] (03CR) 10Ayounsi: "https://phabricator.wikimedia.org/P70620" [homer/public] - 10https://gerrit.wikimedia.org/r/1084129 (owner: 10Ayounsi) [15:08:32] !log Running `find /srv/mediawiki/images/wikitech/archive -type f | xargs rm` on wikitech-static - T374114 T348503 [15:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:53] T374114: Review/update wikitech-static syncing after wikitech moves to Kubernetes - https://phabricator.wikimedia.org/T374114 [15:08:53] T348503: wikitech-static is out of disk - https://phabricator.wikimedia.org/T348503 [15:09:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:17] (03PS2) 10Arnaudb: mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) [15:09:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T376905)', diff saved to https://phabricator.wikimedia.org/P70621 and previous config saved to /var/cache/conftool/dbconfig/20241029-150926-ladsgroup.json [15:09:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [15:09:41] (03CR) 10Arnaudb: "here is my hotfix to list_host_instances" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [15:09:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [15:09:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T376905)', diff saved to https://phabricator.wikimedia.org/P70622 and previous config saved to /var/cache/conftool/dbconfig/20241029-150953-ladsgroup.json [15:10:19] (03PS3) 10Brouberol: mesh.service: introduce a way to further specify the service label selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084031 (https://phabricator.wikimedia.org/T378377) [15:10:21] !log Running `/usr/bin/systemd-cat -t "import-wikitech.sh" /wikitech-static/wikitechsync/import-wikitech.sh &` on wikitech-static - T348503 [15:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:43] (03PS7) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:12:08] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2039.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:12:29] (03PS8) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:13:06] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084101 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [15:13:27] (03CR) 10CI reject: [V:04-1] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:13:57] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:57] (03CR) 10Ladsgroup: [C:03+1] Drop labtestwiki config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [15:14:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1003.eqiad.wmnet with OS bookworm [15:14:22] (03CR) 10Majavah: Drop labtestwiki config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [15:14:35] PROBLEM - Host ganeti2039 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:43] (03CR) 10Elukey: "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:15:46] (03CR) 10Ladsgroup: Drop 'nonglobal' dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [15:16:02] (03CR) 10Ladsgroup: [C:03+1] Drop labtestwiki config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [15:16:19] RECOVERY - Host ganeti2039 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [15:16:31] FIRING: [7x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:13] (03CR) 10Elukey: [C:03+1] "LGTM! Before merging please try to run a local build with docker-pkg (ping me if you never done it) so we can verify that everything build" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:17:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2039.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:17:49] FIRING: [7x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:27] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [15:20:38] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:20:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T376905)', diff saved to https://phabricator.wikimedia.org/P70623 and previous config saved to /var/cache/conftool/dbconfig/20241029-152047-ladsgroup.json [15:22:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [15:23:39] (03PS1) 10Majavah: hieradata: Update Striker to 2024-10-29-151446-production [puppet] - 10https://gerrit.wikimedia.org/r/1084141 [15:24:30] (03CR) 10Scott French: "Thanks, Luca! Yes, indeed - I built locally before posting, with no issues encountered (and did some basic smoke testing of the resulting " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:25:24] !log test prefering lumen-ATT path in eqiad [15:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:25:42] (03CR) 10Majavah: [C:03+2] hieradata: Update Striker to 2024-10-29-151446-production [puppet] - 10https://gerrit.wikimedia.org/r/1084141 (owner: 10Majavah) [15:26:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [15:26:31] FIRING: [7x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:46] 06SRE, 10AbuseFilter, 06cloud-services-team, 06Data Products, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273184 (10sbassett) 05Open→03Resolved a:03Dreamy_Jazz >>! In T375751#10271250, @kostaj... [15:31:17] 06SRE, 10AbuseFilter, 06cloud-services-team, 06Data Products, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273261 (10Dreamy_Jazz) Are we sure that the replicas have been fully updated? My last unders... [15:32:24] (03PS1) 10Ayounsi: Prefer Lumen to reach ATT [homer/public] - 10https://gerrit.wikimedia.org/r/1084143 (https://phabricator.wikimedia.org/T377844) [15:34:00] (03CR) 10Alexandros Kosiaris: [C:03+1] Use 0 workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084076 (owner: 10Mvolz) [15:34:16] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273204 (10sbassett) [15:34:27] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (55489 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [15:35:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P70624 and previous config saved to /var/cache/conftool/dbconfig/20241029-153554-ladsgroup.json [15:35:59] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273332 (10fnegri) [15:37:13] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273352 (10sbassett) If it's just the analytics replicas that were (potentially) remaining, I'd classif... [15:38:03] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084101 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [15:38:44] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273359 (10Dreamy_Jazz) >>! In T375751#10273311, @fnegri wrote: > @Dreamy_Jazz Ben is out this week, I... [15:39:02] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove wikitech stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1084104 (https://phabricator.wikimedia.org/T371878) (owner: 10Muehlenhoff) [15:42:31] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273311 (10fnegri) 05Resolved→03In progress a:05Dreamy_Jazz→03fnegri @Dreamy_Jazz Ben is out th... [15:44:06] (03PS9) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:44:14] (03CR) 10Stevemunene: [C:03+2] Adjust spark-history rediness probe delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084114 (https://phabricator.wikimedia.org/T378497) (owner: 10Stevemunene) [15:44:59] (03CR) 10CI reject: [V:04-1] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:44:59] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10273354 (10fnegri) p:05High→03Medium [15:45:15] (03Merged) 10jenkins-bot: Adjust spark-history rediness probe delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084114 (https://phabricator.wikimedia.org/T378497) (owner: 10Stevemunene) [15:46:44] (03PS10) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:46:56] (03PS1) 10Arnaudb: mariadb: productionize db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) [15:46:56] (03CR) 10Arnaudb: "This CR is to help and test https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1084132 → we need to identify systemd output " [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [15:47:53] !log installing libheif security updates [15:47:54] (03CR) 10CI reject: [V:04-1] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:38] (03PS1) 10Muehlenhoff: Remove obsolete Icinga stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084148 [15:48:38] (03PS1) 10Muehlenhoff: Remove obsolete rendering stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084149 (https://phabricator.wikimedia.org/T357750) [15:48:40] (03PS1) 10Muehlenhoff: Remove stub certs for ms-fe [labs/private] - 10https://gerrit.wikimedia.org/r/1084150 (https://phabricator.wikimedia.org/T357750) [15:49:10] (03PS11) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:49:32] (03PS12) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:50:20] (03PS1) 10Dreamy Jazz: [GlobalBlocking] Enable global autoblocks on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084152 (https://phabricator.wikimedia.org/T377760) [15:50:29] (03CR) 10CI reject: [V:04-1] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:50:54] (03PS1) 10Muehlenhoff: Add library hint for libheif [puppet] - 10https://gerrit.wikimedia.org/r/1084154 [15:51:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P70625 and previous config saved to /var/cache/conftool/dbconfig/20241029-155101-ladsgroup.json [15:52:59] (03PS13) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:54:08] (03CR) 10CI reject: [V:04-1] airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:54:31] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [15:54:37] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:54:57] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:55:18] (03PS14) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [15:55:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2040.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:55:52] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [15:56:15] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libheif [puppet] - 10https://gerrit.wikimedia.org/r/1084154 (owner: 10Muehlenhoff) [15:56:26] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [15:56:41] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [15:57:01] PROBLEM - Host ganeti2040 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete Icinga stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084148 (owner: 10Muehlenhoff) [15:58:12] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete rendering stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1084149 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [15:59:51] RECOVERY - Host ganeti2040 is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms [16:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1600). [16:00:05] Pppery: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] here [16:00:13] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10273476 (10elukey) [16:00:36] Pppery: hi! let's do it :) [16:00:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2040.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:00:45] I assume if I deploy to mwdebug you can test there? [16:00:58] I think so [16:01:04] This is my first time doing anything with puppet, though [16:01:17] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:01:17] but I do have the WikimediaDebug extension installed from config deploys [16:01:17] no worries, I've got the puppet part [16:01:21] do you have the-- okay perfect [16:01:28] same thing on your end [16:01:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:32] (03CR) 10RLazarus: [C:03+2] Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [16:02:38] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:02:47] Pppery: merging now, it'll be ready to test in a moment [16:02:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10273484 (10Gehel) p:05Triage→03High [16:03:13] (03PS2) 10Scott French: Add JobQueueLowTrafficProcessingRateTooHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1083904 [16:03:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:05:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [16:05:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [16:06:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T376905)', diff saved to https://phabricator.wikimedia.org/P70626 and previous config saved to /var/cache/conftool/dbconfig/20241029-160607-ladsgroup.json [16:06:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [16:06:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [16:06:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:07:56] Is this ready to test yet? [16:08:29] still getting there [16:08:54] (if you're curious, it's merged at the puppetserver and puppet is now running on the deploy host, then it'll start rolling out with helmfile) [16:09:07] oop second stage just finished, here comes scap [16:09:35] I know some of what you are talking about, but the amount of time proved longer than my definition of "moment" [16:10:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [16:10:49] haha [16:10:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [16:11:02] some people get annoyed when this whole apparatus takes time to run, but I think it just makes a big deploy feel more momentous [16:11:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T376905)', diff saved to https://phabricator.wikimedia.org/P70627 and previous config saved to /var/cache/conftool/dbconfig/20241029-161103-ladsgroup.json [16:11:08] okay diffs look good, rolling to the testservers [16:11:10] !log rzl@deploy2002 Started scap sync-world: 1079056 T376923 [16:11:33] I wouldn't be annoyed per se, just it took longer than your comments suggested it would take [16:13:17] !log rzl@deploy2002 rzl: 1079056 T376923 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:13:22] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [16:13:24] testing [16:13:34] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2044.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:15:02] Seems to work [16:15:22] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:15:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:16:36] Pppery: proceeding, thanks [16:16:38] !log rzl@deploy2002 rzl: Continuing with sync [16:18:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2044.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:19:54] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2041.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:21:08] (03CR) 10Ladsgroup: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [16:21:19] !log rzl@deploy2002 Finished scap sync-world: 1079056 T376923 (duration: 11m 47s) [16:21:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T376905)', diff saved to https://phabricator.wikimedia.org/P70629 and previous config saved to /var/cache/conftool/dbconfig/20241029-162136-ladsgroup.json [16:22:03] PROBLEM - Host ganeti2041 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:06] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [16:22:41] (03CR) 10Ladsgroup: tables-catalog: Add translate_cache table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [16:23:06] Pppery: synced everywhere, still look good with the debug extension turned off? [16:23:08] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10273631 (10elukey) [16:23:45] yep, or at least the one part I tested does [16:24:05] RECOVERY - Host ganeti2041 is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [16:24:20] Thanks [16:24:28] thank you! [16:24:43] puppet window's done ✅ [16:25:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2041.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:25:19] rzl: I have another patch wait! [16:25:21] * elukey runs away [16:25:35] elukey: for you I have all day [16:25:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1084045 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [16:26:03] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:26:16] rzl: <3 [16:26:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:26:31] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:39] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:30:57] (03PS3) 10Majavah: dynamicproxy: Allow creating proxy at zone apex [puppet] - 10https://gerrit.wikimedia.org/r/1083862 (https://phabricator.wikimedia.org/T342398) [16:30:57] (03PS2) 10Majavah: dynamicproxy: Allow zones not managed in Designate [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) [16:30:58] (03PS1) 10Majavah: openstack: wmcs-webproxy: Support arbitrary domains [puppet] - 10https://gerrit.wikimedia.org/r/1084167 [16:31:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:31:57] (03CR) 10CI reject: [V:04-1] dynamicproxy: Allow zones not managed in Designate [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:32:15] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:35:43] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:36:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P70630 and previous config saved to /var/cache/conftool/dbconfig/20241029-163643-ladsgroup.json [16:39:37] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4427/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075615 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [16:39:53] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517 (10CDobbins) 03NEW [16:40:02] (03PS1) 10Volans: orchestrator: do not retry on 500s [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084170 [16:40:02] (03PS1) 10Volans: mysql_legacy: accept any exit code for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084171 [16:40:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:41:25] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:42:10] (03CR) 10Majavah: [C:03+2] openstack: wmcs-webproxy: Support arbitrary domains [puppet] - 10https://gerrit.wikimedia.org/r/1084167 (owner: 10Majavah) [16:42:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:45:16] (03PS4) 10Majavah: dynamicproxy: Allow creating proxy at zone apex [puppet] - 10https://gerrit.wikimedia.org/r/1083862 (https://phabricator.wikimedia.org/T342398) [16:45:17] (03PS3) 10Majavah: dynamicproxy: Allow zones not managed in Designate [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) [16:46:56] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4429/co" [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:47:09] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [16:47:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:47:55] (03PS5) 10Majavah: dynamicproxy: Allow creating proxy at zone apex [puppet] - 10https://gerrit.wikimedia.org/r/1083862 (https://phabricator.wikimedia.org/T342398) [16:47:55] (03PS4) 10Majavah: dynamicproxy: Allow zones not managed in Designate [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) [16:48:39] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4430/co" [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:49:09] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10273746 (10thcipriani) >>! In T378082#10271688, @tappof wrote: > @thcipriani your approval is needed for the deployment group. > Thanks. Thanks! > Reason for access: I will be w... [16:49:22] (03CR) 10Majavah: [C:03+2] dynamicproxy: Allow creating proxy at zone apex [puppet] - 10https://gerrit.wikimedia.org/r/1083862 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:49:32] (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Allow zones not managed in Designate [puppet] - 10https://gerrit.wikimedia.org/r/1083868 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [16:49:51] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:50:57] (03PS23) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [16:50:57] (03PS32) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [16:51:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P70631 and previous config saved to /var/cache/conftool/dbconfig/20241029-165150-ladsgroup.json [16:51:52] (03CR) 10CI reject: [V:04-1] haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [16:52:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10273775 (10wiki_willy) Hi @jcrespo - thanks for your feedback on this. My apologies that these Config J servers have been causing a lot of headaches.... [16:53:28] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:54:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:54:59] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:55:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:55:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:56:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:57:01] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:58:04] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:58:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:59:49] (03PS1) 10MVernon: Scrape the cephadm cluster endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) [17:00:05] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1700) [17:00:34] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:00:38] (03CR) 10Ladsgroup: [C:03+1] mariadb: productionize db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [17:00:51] (03PS1) 10Majavah: hieradata: Bump horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1084175 [17:01:19] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10273818 (10elukey) [17:01:45] (03CR) 10Majavah: [C:03+2] hieradata: Bump horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1084175 (owner: 10Majavah) [17:02:01] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10273819 (10ssingh) [17:03:51] (03CR) 10Ladsgroup: "It'd be easier to review and do them if we do them one by one to make sure we avoid mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/1083758 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [17:04:09] (03PS24) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [17:04:09] (03PS33) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [17:05:06] (03CR) 10CI reject: [V:04-1] haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:05:17] (03CR) 10Hnowlan: [C:04-1] "This approach will not work - from a config syntax perspective it's entirely correct to state upstream_connection_options on the cluster f" [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [17:05:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:06:46] (03CR) 10Elukey: [C:03+1] orchestrator: do not retry on 500s [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084170 (owner: 10Volans) [17:06:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T376905)', diff saved to https://phabricator.wikimedia.org/P70632 and previous config saved to /var/cache/conftool/dbconfig/20241029-170657-ladsgroup.json [17:07:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [17:07:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [17:07:28] (03PS34) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [17:08:10] (03CR) 10BCornwall: [C:03+2] librenms: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075616 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [17:08:25] (03CR) 10BCornwall: [V:03+1 C:03+2] icinga: Remove external monitoring rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075615 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [17:08:58] (03CR) 10Elukey: [C:03+1] mysql_legacy: accept any exit code for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084171 (owner: 10Volans) [17:09:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10273829 (10Ladsgroup) >>! In T378143#10266787, @ABran-WMF wrote: > I've tried to reproduce what's been done in T355269 which is quite close to what we'... [17:11:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:12:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [17:12:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [17:12:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T376905)', diff saved to https://phabricator.wikimedia.org/P70633 and previous config saved to /var/cache/conftool/dbconfig/20241029-171258-ladsgroup.json [17:14:24] (03PS1) 10Majavah: hieradata: Bump codfw1dev horizon to 2024-10-29-170800 [puppet] - 10https://gerrit.wikimedia.org/r/1084178 [17:15:12] (03CR) 10Majavah: [C:03+2] hieradata: Bump codfw1dev horizon to 2024-10-29-170800 [puppet] - 10https://gerrit.wikimedia.org/r/1084178 (owner: 10Majavah) [17:17:53] (03CR) 10Hnowlan: [V:03+2 C:03+2] "Never mind - build_envoy_config.py will pick up this change, I misunderstood." [puppet] - 10https://gerrit.wikimedia.org/r/1083207 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [17:17:59] !log Running `foreachwiki userOptions.php --delete --old=A --old=D --old=C --old=null --old=imagerecommendation --old=linkrecommendation growthexperiments-homepage-variant` [17:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:42] (03CR) 10Ladsgroup: mariadb: pii cleaner cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [17:21:57] (03CR) 10Majavah: [C:03+2] hieradata: Update eqiad1 Horizon in 2024-10-29-170800 [puppet] - 10https://gerrit.wikimedia.org/r/1084180 (owner: 10Majavah) [17:22:06] (03CR) 10Jforrester: "Dupe of I13b1354538495fe5a6df562958fd1c9f960e5df4. Shouldn't need to wait for the config to land first?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) (owner: 10Majavah) [17:22:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T376905)', diff saved to https://phabricator.wikimedia.org/P70634 and previous config saved to /var/cache/conftool/dbconfig/20241029-172228-ladsgroup.json [17:22:30] (03PS1) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [17:22:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [17:29:06] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:29:55] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:30:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:30:44] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10273943 (10KOfori) This is approved. [17:30:52] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:31:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:32:01] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:32:18] (03CR) 10Ladsgroup: "Something to note for the expansion: Depool part can't be used currently on any other automation. We need to build a way to make sure it's" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [17:33:09] (03PS25) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [17:34:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:35:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10273979 (10Dwisehaupt) @cmooney @Jclark-ctr Got confirmation that the date shift is good. We are all set to do the network upd... [17:35:48] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance centrallog2002:6514) - https://phabricator.wikimedia.org/T377703#10273981 (10herron) [17:35:48] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#10273980 (10herron) [17:36:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:37:12] (03PS35) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [17:37:19] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:37:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P70635 and previous config saved to /var/cache/conftool/dbconfig/20241029-173735-ladsgroup.json [17:37:40] (03PS1) 10Kosta Harlan: QuickSurveys: Undeploy safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084183 (https://phabricator.wikimedia.org/T376517) [17:38:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084183 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [17:39:00] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#10273990 (10herron) Looked into this a bit since the silence expired, the service being probed is up but looks like the related prometheus blackbox exporter exp... [17:41:31] (03CR) 10CDobbins: [C:03+2] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:42:41] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [17:42:43] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10274025 (10kostajh) >>! In T375751#10273261, @Dreamy_Jazz wrote: > Are we sure that the replicas have b... [17:43:04] (03PS1) 10Umherirrender: build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 [17:43:37] (03PS2) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [17:44:10] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10274047 (10ssingh) [17:44:13] (03PS26) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [17:45:21] (03PS36) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [17:45:22] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10274057 (10ssingh) Kerberos required. [17:45:35] (03CR) 10Umherirrender: "Noop from production view, but this was not merged before the train cut due to unrelated CI error and breaks CI for this branch" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [17:45:54] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10274056 (10ssingh) @Ottomata: requires your approval please and thank you! [17:46:02] (03PS2) 10BCornwall: librenms: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075616 (https://phabricator.wikimedia.org/T375569) [17:46:11] (03PS2) 10BCornwall: icinga: Remove external monitoring rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075615 (https://phabricator.wikimedia.org/T375569) [17:46:17] (03CR) 10Ssingh: [C:03+1] haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:46:47] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: upgrade to 2024-10-15-214239 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082318 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [17:46:51] (03CR) 10BCornwall: [V:03+2 C:03+2] librenms: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075616 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [17:46:58] (03CR) 10BCornwall: [V:03+2 C:03+2] icinga: Remove external monitoring rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075615 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [17:47:40] (03CR) 10CDobbins: [C:03+2] "This was just merged. What are the next steps needed to ensure this becomes an actual metric in Prometheus?" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:47:53] (03CR) 10Michael Große: [C:03+1] build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [17:47:53] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: upgrade to 2024-10-15-214239 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082318 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [17:48:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [17:49:31] !log Remove RSA cert support from Icinga, librenms (T375569) [17:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:47] T375569: Remove RSA certificates from puppet - https://phabricator.wikimedia.org/T375569 [17:50:51] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:50:56] (03PS2) 10BCornwall: dynamicproxy: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075606 (https://phabricator.wikimedia.org/T375569) [17:51:44] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:52:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P70636 and previous config saved to /var/cache/conftool/dbconfig/20241029-175243-ladsgroup.json [17:53:05] (03CR) 10Majavah: [C:03+2] dynamicproxy: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075606 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [17:53:34] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10274119 (10Ottomata) Approved [17:54:05] (03CR) 10Ottomata: admin - explicit approval not needed for analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [17:54:52] (03CR) 10Urbanecm: [C:03+1] build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [17:55:11] (03PS3) 10Ottomata: admin - explicit approval not needed for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) [17:55:31] (03CR) 10Ottomata: admin - explicit approval not needed for analytics-privatedata-users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [17:56:09] (03CR) 10Ottomata: [C:03+2] admin - explicit approval not needed for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [17:58:48] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10274131 (10Ottomata) Merged! [18:00:05] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T1800). [18:00:52] (03CR) 10Ssingh: [C:03+1] haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [18:01:14] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10274146 (10ssingh) [18:01:15] o/ [18:01:38] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084190 (https://phabricator.wikimedia.org/T375660) [18:01:40] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084190 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [18:02:26] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084190 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [18:03:25] (03CR) 10Dzahn: [C:03+2] durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [18:03:38] (03PS6) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [18:05:09] (03PS7) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [18:07:42] (03CR) 10Cwhite: Scrape the cephadm cluster endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [18:07:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T376905)', diff saved to https://phabricator.wikimedia.org/P70637 and previous config saved to /var/cache/conftool/dbconfig/20241029-180750-ladsgroup.json [18:07:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:08:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:08:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T376905)', diff saved to https://phabricator.wikimedia.org/P70638 and previous config saved to /var/cache/conftool/dbconfig/20241029-180816-ladsgroup.json [18:08:58] (03CR) 10CI reject: [V:04-1] build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [18:09:27] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:09:32] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [18:10:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:10:58] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.1 refs T375660 [18:11:25] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [18:11:56] (03PS1) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [18:12:34] (03CR) 10CI reject: [V:04-1] wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [18:15:23] (03CR) 10Umherirrender: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [18:15:38] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10274253 (10RobH) @papaul: As the point of contact between DC Ops and #netops, did you want to handle the router rules/ACLs to allow us to reinstall all o... [18:18:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T376905)', diff saved to https://phabricator.wikimedia.org/P70639 and previous config saved to /var/cache/conftool/dbconfig/20241029-181838-ladsgroup.json [18:22:56] (03CR) 10Volans: [C:03+2] "What checks do you currently do?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [18:23:56] (03PS1) 10Dzahn: durum: ensure nftables DENY sets are present (does not mean drop) [puppet] - 10https://gerrit.wikimedia.org/r/1084196 [18:26:09] (03CR) 10Ssingh: [C:03+1] durum: ensure nftables DENY sets are present (does not mean drop) [puppet] - 10https://gerrit.wikimedia.org/r/1084196 (owner: 10Dzahn) [18:27:05] (03CR) 10Dzahn: [C:03+2] durum: ensure nftables DENY sets are present (does not mean drop) [puppet] - 10https://gerrit.wikimedia.org/r/1084196 (owner: 10Dzahn) [18:27:17] (03CR) 10Volans: mariadb: pii cleaner cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [18:27:32] (03PS1) 10Hnowlan: TimedMediaHandler: use shellbox globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) [18:29:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:31:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:32:11] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:32:23] (03CR) 10TAndic: [C:03+1] "Looks correct on my end :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084183 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [18:33:29] (03PS2) 10Herron: prometheus-blackbox-exporter: override default user [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) [18:33:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P70640 and previous config saved to /var/cache/conftool/dbconfig/20241029-183345-ladsgroup.json [18:37:44] !log shellbox-syntaxhighlight updated to shellbox 2024-10-15-214239 - T375243 [18:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:49] T375243: Turn up PHP 8.1 Shellbox deployments - https://phabricator.wikimedia.org/T375243 [18:40:20] Reedy: thanks for triaging https://phabricator.wikimedia.org/T378531 ! [18:41:41] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10274381 (10Ottomata) Also updated docs here: https://wikitech.wikimedia.org/w/index.php?title=SRE%2FClinic_Duty%2FAccess_r... [18:44:16] (03PS1) 10Dzahn: durum: nftables: set max_connections to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1084207 [18:45:41] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1084207 (owner: 10Dzahn) [18:46:09] (03CR) 10Ssingh: [C:03+1] "I am OK with the duration I think and we can always revisit this, so no big deal." [puppet] - 10https://gerrit.wikimedia.org/r/1084207 (owner: 10Dzahn) [18:48:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P70641 and previous config saved to /var/cache/conftool/dbconfig/20241029-184852-ladsgroup.json [18:49:36] (03CR) 10Dzahn: [C:03+2] durum: nftables: set max_connections to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1084207 (owner: 10Dzahn) [18:49:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10274408 (10bking) DC Ops, this host is hard down, feel free to replace RAM or take any other actions to restore it to working condition at your convenience (th... [18:51:04] (03CR) 10Legoktm: "I know in T370837 there are statistics about RSA certs being barely used, does that also apply to mail traffic (exim)? If not, should we b" [puppet] - 10https://gerrit.wikimedia.org/r/1075604 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:55:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10274429 (10bking) a:03VRiley-WMF [19:00:40] (03CR) 10MVernon: Scrape the cephadm cluster endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [19:02:09] (03PS2) 10MVernon: Scrape the cephadm cluster endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) [19:02:29] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [19:03:20] (03CR) 10MVernon: "Thanks, fixed in v2." [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [19:04:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T376905)', diff saved to https://phabricator.wikimedia.org/P70642 and previous config saved to /var/cache/conftool/dbconfig/20241029-190359-ladsgroup.json [19:04:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [19:04:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [19:04:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:04:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:04:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T376905)', diff saved to https://phabricator.wikimedia.org/P70643 and previous config saved to /var/cache/conftool/dbconfig/20241029-190442-ladsgroup.json [19:04:58] (03CR) 10Umherirrender: "The last failure is via GlobalPreferences, that extensions is a dependency since some hours blocking this patch set - Idd26d441e240fe58886" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [19:08:52] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [19:13:12] (03CR) 10Hamish: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [19:13:31] (03PS2) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [19:14:10] (03CR) 10CI reject: [V:04-1] wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [19:15:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T376905)', diff saved to https://phabricator.wikimedia.org/P70644 and previous config saved to /var/cache/conftool/dbconfig/20241029-191508-ladsgroup.json [19:16:58] (03CR) 10Ladsgroup: "from show processlist, there shouldn't be anything with wikiuser/wikiadmin users (the users change, so the check should be wikiuser in the" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [19:20:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10274502 (10VRiley-WMF) Currently, Dell has this open. We should recieve it tomorrow. Service request number: 200116199 Work order number: 455501551 Replacemen... [19:22:28] (03PS3) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [19:23:05] (03CR) 10CI reject: [V:04-1] wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [19:24:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 (owner: 10Pppery) [19:25:48] (03PS1) 10CDobbins: admin: add cdobbins to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) [19:26:13] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10274524 (10Ladsgroup) 05Stalled→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikipedia-ru-oversighters.lists.wikimedia.org I got the email o... [19:28:50] (03CR) 10Ssingh: admin: add cdobbins to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) (owner: 10CDobbins) [19:30:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P70645 and previous config saved to /var/cache/conftool/dbconfig/20241029-193015-ladsgroup.json [19:32:53] (03PS2) 10CDobbins: admin: add cdobbins to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) [19:34:18] (03CR) 10CDobbins: admin: add cdobbins to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) (owner: 10CDobbins) [19:36:02] (03CR) 10Cyndywikime: [C:03+1] Growth [test2wiki]: enable community updates module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) (owner: 10Sergio Gimeno) [19:45:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P70646 and previous config saved to /var/cache/conftool/dbconfig/20241029-194522-ladsgroup.json [19:45:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10274577 (10Jclark-ctr) @cmooney fyi i have 10x of the 100g green handled optics [19:55:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on an-worker1165.eqiad.wmnet with reason: T378454 [19:55:50] T378454: an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454 [19:56:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on an-worker1165.eqiad.wmnet with reason: T378454 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241029T2000). [20:00:05] kostajh and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] here [20:00:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T376905)', diff saved to https://phabricator.wikimedia.org/P70647 and previous config saved to /var/cache/conftool/dbconfig/20241029-200029-ladsgroup.json [20:00:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [20:00:38] greetings [20:00:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [20:00:52] I can deploy [20:00:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T376905)', diff saved to https://phabricator.wikimedia.org/P70648 and previous config saved to /var/cache/conftool/dbconfig/20241029-200056-ladsgroup.json [20:01:16] * cjming thanks kostajh [20:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084183 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [20:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 (owner: 10Pppery) [20:02:57] (03Merged) 10jenkins-bot: QuickSurveys: Undeploy safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084183 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [20:03:00] (03Merged) 10jenkins-bot: Missing.php: redirect wikisources to localized main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082875 (owner: 10Pppery) [20:03:29] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1084183|QuickSurveys: Undeploy safety survey (T376517)]], [[gerrit:1082875|Missing.php: redirect wikisources to localized main page]] [20:03:42] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [20:05:55] !log kharlan@deploy2002 pppery, kharlan: Backport for [[gerrit:1084183|QuickSurveys: Undeploy safety survey (T376517)]], [[gerrit:1082875|Missing.php: redirect wikisources to localized main page]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:42] Pppery: can you test your change please? [20:06:48] was already testing [20:07:03] Seems to work as intended [20:07:10] cool [20:08:02] !log kharlan@deploy2002 pppery, kharlan: Continuing with sync [20:11:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T376905)', diff saved to https://phabricator.wikimedia.org/P70649 and previous config saved to /var/cache/conftool/dbconfig/20241029-201131-ladsgroup.json [20:11:42] (03CR) 10Pppery: "General tip: You can probably get these puppet patches merged by listing them for a Puppet Requests window. See documentation at https://w" [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [20:12:45] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084183|QuickSurveys: Undeploy safety survey (T376517)]], [[gerrit:1082875|Missing.php: redirect wikisources to localized main page]] (duration: 09m 16s) [20:12:50] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [20:13:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:14:06] all done [20:14:17] thanks [20:14:25] !log UTC late deploys done [20:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:20:13] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10274656 (10Papaul) @RobH yes i can take care of that [20:22:06] (03CR) 10Cwhite: [C:04-1] "I'm not confident it's best to make blackbox-exporter run as root. Is it impossible to make the tls public key something the prometheus u" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [20:26:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P70650 and previous config saved to /var/cache/conftool/dbconfig/20241029-202638-ladsgroup.json [20:28:18] (03CR) 10Cwhite: [C:03+2] Scrape the cephadm cluster endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1084174 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [20:41:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P70651 and previous config saved to /var/cache/conftool/dbconfig/20241029-204145-ladsgroup.json [20:42:16] FIRING: JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:56:46] 10SRE-swift-storage, 13Patch-For-Review: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10274769 (10colewhite) @MatthewVernon cephadm clusters are now being scraped, however the ones in codfw (moss-be200[123]) don't appear to have anything listening to port 9283 [20:56:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T376905)', diff saved to https://phabricator.wikimedia.org/P70652 and previous config saved to /var/cache/conftool/dbconfig/20241029-205652-ladsgroup.json [20:56:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [20:57:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [20:57:13] (03CR) 10Jcrespo: "As long as you don't touch existing hosts (which would require more operations for depooling, and some dependencies due to upgrades), sett" [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [20:57:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T376905)', diff saved to https://phabricator.wikimedia.org/P70653 and previous config saved to /var/cache/conftool/dbconfig/20241029-205718-ladsgroup.json [20:59:45] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10274774 (10RobH) [21:08:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T376905)', diff saved to https://phabricator.wikimedia.org/P70654 and previous config saved to /var/cache/conftool/dbconfig/20241029-210855-ladsgroup.json [21:22:33] 06SRE, 10SRE-Access-Requests: Access to ops mailing list - https://phabricator.wikimedia.org/T378484#10274826 (10Dzahn) The easiest way is to go to https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/ and fill out the "Subscribe" form at the bottom. That will notify the list admins for approval. [21:24:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P70655 and previous config saved to /var/cache/conftool/dbconfig/20241029-212402-ladsgroup.json [21:26:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10274838 (10jcrespo) I can wait for that a few extra days. But I need certainty on dates or know that a new testing period is in front of us that could... [21:36:13] (03PS3) 10Gmodena: data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 [21:37:35] (03PS4) 10Gmodena: data-engineering: hdfs: alert on rate of rcp calls [alerts] - 10https://gerrit.wikimedia.org/r/1084098 (https://phabricator.wikimedia.org/T376713) [21:39:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P70656 and previous config saved to /var/cache/conftool/dbconfig/20241029-213910-ladsgroup.json [21:44:46] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10274885 (10Dzahn) Almost certainly this request should be for the following: - LDAP group nda (after NDA is signed) - LDAP group wmde (normal for all other WMDE staff) - WMF-NDA group... [21:44:47] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10274886 (10Dzahn) a:03JoelyRooke-WMDE [21:45:20] (03CR) 10Dzahn: [C:03+1] admin: add cdobbins to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) (owner: 10CDobbins) [21:50:58] (03PS1) 10Bking: search platform: add config for new search platform hosts [puppet] - 10https://gerrit.wikimedia.org/r/1084253 (https://phabricator.wikimedia.org/T378031) [21:54:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T376905)', diff saved to https://phabricator.wikimedia.org/P70657 and previous config saved to /var/cache/conftool/dbconfig/20241029-215417-ladsgroup.json [21:54:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [21:54:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [21:54:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T376905)', diff saved to https://phabricator.wikimedia.org/P70658 and previous config saved to /var/cache/conftool/dbconfig/20241029-215443-ladsgroup.json [21:56:41] (03CR) 10Ryan Kemper: [C:03+1] search platform: add config for new search platform hosts [puppet] - 10https://gerrit.wikimedia.org/r/1084253 (https://phabricator.wikimedia.org/T378031) (owner: 10Bking) [21:57:01] (03CR) 10Bking: [C:03+2] search platform: add config for new search platform hosts [puppet] - 10https://gerrit.wikimedia.org/r/1084253 (https://phabricator.wikimedia.org/T378031) (owner: 10Bking) [22:01:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T376905)', diff saved to https://phabricator.wikimedia.org/P70659 and previous config saved to /var/cache/conftool/dbconfig/20241029-220156-ladsgroup.json [22:15:28] (03PS1) 10Zabe: Revert "Skin: [BREAKING CHANGE] Remove support for rendering outside body element" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084255 (https://phabricator.wikimedia.org/T378531) [22:16:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P70660 and previous config saved to /var/cache/conftool/dbconfig/20241029-221703-ladsgroup.json [22:17:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:32:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P70661 and previous config saved to /var/cache/conftool/dbconfig/20241029-223210-ladsgroup.json [22:41:00] (03CR) 10Jdlrobson: "I'm not convinced this is the source - since the error is occurring in Mustache template parser and this change didn't touch that..." [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084255 (https://phabricator.wikimedia.org/T378531) (owner: 10Zabe) [22:47:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T376905)', diff saved to https://phabricator.wikimedia.org/P70662 and previous config saved to /var/cache/conftool/dbconfig/20241029-224717-ladsgroup.json [22:47:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [22:47:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [22:49:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10275130 (10jcrespo) T262388 is the bug, but I couldn't fix it because I couldn't reproduce it at the time. I highly recommend double checking data afte... [23:00:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T376905)', diff saved to https://phabricator.wikimedia.org/P70664 and previous config saved to /var/cache/conftool/dbconfig/20241029-230020-ladsgroup.json [23:02:06] (03PS1) 10Ladsgroup: Add UAE user group mobile domain [dns] - 10https://gerrit.wikimedia.org/r/1084263 (https://phabricator.wikimedia.org/T152882) [23:15:08] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add UAE user group mobile domain [dns] - 10https://gerrit.wikimedia.org/r/1084263 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [23:15:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P70665 and previous config saved to /var/cache/conftool/dbconfig/20241029-231527-ladsgroup.json [23:17:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:17:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:17:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:17:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:17:33] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:08] Amir1: ^ wow that new alert doesn't mess around :) [23:18:09] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:09] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is e61f3a33a9e51aee245c2b294a738f7c460a3f42, dns.git is 3f5945bb0b6f61c43857b638da7c5e0696e3addd) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:18:28] wow [23:18:45] rzl: I just did the sync now, it should resolve in a sec [23:19:12] that was fast [23:22:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:22:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:22:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:22:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:22:33] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:09] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:09] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [23:23:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10275168 (10wiki_willy) Thanks for the context, Jaime. Based on your current needs and with the time constraints, it sounds like it'll be better havin... [23:28:39] rzl: recovered, sorry for the noise [23:30:09] (03PS1) 10Zabe: Initial configuration for tcywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084266 (https://phabricator.wikimedia.org/T377922) [23:30:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P70666 and previous config saved to /var/cache/conftool/dbconfig/20241029-233034-ladsgroup.json [23:30:35] Amir1: haha no worries, I just hadn't seen it fire yet, I was impressed by how prompt it was [23:35:35] (03CR) 10Zabe: [C:03+2] Initial configuration for tcywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084266 (https://phabricator.wikimedia.org/T377922) (owner: 10Zabe) [23:36:16] (03Merged) 10jenkins-bot: Initial configuration for tcywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084266 (https://phabricator.wikimedia.org/T377922) (owner: 10Zabe) [23:37:59] sorry folks :) [23:38:04] the idea for it is to be prompt [23:38:17] if you think we should tone it down a bit please let me know [23:38:22] check_interval is 5, retry is 1 [23:39:59] maybe we can bump the retry a bit -- open to suggestions [23:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 11.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:41:18] !log zabe@deploy2002 Started scap sync-world: Creating tcywiktionary (T377922) [23:41:26] T377922: Create Wiktionary Tulu - https://phabricator.wikimedia.org/T377922 [23:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:44:43] (03CR) 10Aleksandar Mastilovic: "I can't really "review" this since reading the changes was mostly educational for me and I have to take the changes for granted, but I do " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [23:45:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T376905)', diff saved to https://phabricator.wikimedia.org/P70667 and previous config saved to /var/cache/conftool/dbconfig/20241029-234541-ladsgroup.json [23:45:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [23:46:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [23:46:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T376905)', diff saved to https://phabricator.wikimedia.org/P70668 and previous config saved to /var/cache/conftool/dbconfig/20241029-234608-ladsgroup.json [23:46:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 24.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:48:45] !log zabe@deploy2002 Finished scap sync-world: Creating tcywiktionary (T377922) (duration: 07m 26s) [23:48:49] T377922: Create Wiktionary Tulu - https://phabricator.wikimedia.org/T377922 [23:53:18] !log zabe@mwmaint2002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=tcywiktionary --cluster=all 2>&1 | tee /tmp/tcywiktionary.UpdateSearchIndexConfig.log # T377922 [23:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T376905)', diff saved to https://phabricator.wikimedia.org/P70669 and previous config saved to /var/cache/conftool/dbconfig/20241029-235326-ladsgroup.json [23:55:08] (03CR) 10Cwhite: [C:04-1] "What about the certificates in `/etc/prometheus/ssl`?" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [23:59:58] (03PS1) 10Zabe: Initial configuration for tcywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084269 (https://phabricator.wikimedia.org/T377919)