[00:00:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P70731 and previous config saved to /var/cache/conftool/dbconfig/20241031-000000-ladsgroup.json [00:04:54] (03CR) 10Bking: [C:03+1] "+1, you already addressed my concern in Slack (https://wikimedia.slack.com/archives/C055QGPTC69/p1730218263581889)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [00:09:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279554 (10Papaul) @MatthewVernon hey even after fixing the disk issue, i am still getting the same error. Is there anything that i am missing here? Than... [00:15:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P70732 and previous config saved to /var/cache/conftool/dbconfig/20241031-001507-ladsgroup.json [00:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10279568 (10Jclark-ctr) 05Open→03Resolved Cpu2 and main board replaced today by tech [00:30:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T376905)', diff saved to https://phabricator.wikimedia.org/P70733 and previous config saved to /var/cache/conftool/dbconfig/20241031-003014-ladsgroup.json [00:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084928 [00:38:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084928 (owner: 10TrainBranchBot) [00:43:14] (03CR) 10Ssingh: [C:03+1] "[fwiw, I think it's acceptable to simply rm -rf the directory since it's just a single host but I see why you might want to do that throug" [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [00:43:56] (03CR) 10Ssingh: [C:03+1] "(I don't think you need it here but I wanted to also point out https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-recurse" [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [01:00:04] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084281 (owner: 10TrainBranchBot) [01:09:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084930 [01:09:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084930 (owner: 10TrainBranchBot) [01:09:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084928 (owner: 10TrainBranchBot) [01:38:27] !log krinkle@deploy2002 Started deploy [integration/docroot@a2c044c]: T378542 [01:38:33] T378542: ASC/DESC sorting on docroot index - https://phabricator.wikimedia.org/T378542 [01:38:36] !log krinkle@deploy2002 Finished deploy [integration/docroot@a2c044c]: T378542 (duration: 00m 23s) [01:39:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084930 (owner: 10TrainBranchBot) [01:42:41] !log krinkle@mwmaint2001$ Purge https://doc.wikimedia.org/lib/wmui-page.css via `mwscript extensions/WikimediaMaintenance/purgeUrls.php`, T257188 T378542 [01:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:47] T257188: Shorten unconditional 1 hour cache from doc.wikimedia.org (outdated coverage report, JS error, mismatching styles) - https://phabricator.wikimedia.org/T257188 [01:45:48] !log krinkle@deploy2002 Started deploy [integration/docroot@0b03488]: (no justification provided) [01:45:57] !log krinkle@deploy2002 Finished deploy [integration/docroot@0b03488]: (no justification provided) (duration: 00m 10s) [01:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [01:55:54] (03PS1) 10Pppery: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084932 (https://phabricator.wikimedia.org/T378463) [01:57:33] (03PS2) 10Pppery: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084932 (https://phabricator.wikimedia.org/T378463) [02:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:01] (03CR) 10Cwhite: Profiler: centralize metrics send to a function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [03:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:51:55] (03PS5) 10Anzx: tcywikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) [04:52:39] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [04:54:00] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [04:54:48] (03PS7) 10Anzx: tcywiktionary: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) [04:57:33] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [05:46:37] (03PS2) 10Anzx: tcywiktionary: add logos, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) [05:46:44] (03PS3) 10Anzx: tcywikisource: Add logos, namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) [05:48:59] (03PS4) 10Anzx: tcywikisource: Add logos, namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) [05:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T0600). [06:10:04] (03PS5) 10Anzx: tcywiktionary: add logos, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) [06:13:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [06:15:38] (03PS6) 10Anzx: tcywiktionary: add SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) [06:16:05] (03PS5) 10Anzx: tcywikisource: Add namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) [06:17:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [06:17:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [06:18:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:42] (03CR) 10Fabfur: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [07:18:44] (03CR) 10Fabfur: [C:03+2] haproxykafka: ensure directories are removed when ensure=>absent [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [07:26:56] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) [07:28:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [07:41:48] (03CR) 10Fabfur: [C:04-1] "This needs to be applied after https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/75" [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [07:44:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10279786 (10ABran-WMF) 05Resolved→03Open a:05VRiley-WMF→03ABran-WMF I'm unable to connect to the server, wether on ipmi or on SSH, reopening to troubleshoot [07:48:15] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10279794 (10VRiley-WMF) My apologies. Please try again. It seemed that the ethernet wasn't fully seated. [08:00:04] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T0800). [08:00:04] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:36] * anzx o/ [08:01:05] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bullseye [08:03:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10279812 (10ABran-WMF) ah! this will simplify the debugging indeed, db1234 is up and alive, thanks @VRiley-WMF ! [08:12:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 56258 [08:13:34] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 56258 [08:21:35] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1019.eqiad.wmnet with OS bullseye [08:23:07] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [08:23:33] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [08:35:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279828 (10elukey) It seems that https://puppetboard.wikimedia.org/node/ms-be2083.codfw.wmnet shows 4 entries for `accounts` under the `swift_disks` fact... [09:07:25] !log importing haproxykafka 0.3 package into apt repository (T377613) [09:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:33] T377613: Provide Debian packetization - https://phabricator.wikimedia.org/T377613 [09:08:44] (03CR) 10Fabfur: "haproxykafka 0.3 package built and imported successfully into apt repos" [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [09:32:57] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-ctrl1003.eqiad.wmnet [09:34:05] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1003.eqiad.wmnet [09:34:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:34:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:34:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T376905)', diff saved to https://phabricator.wikimedia.org/P70734 and previous config saved to /var/cache/conftool/dbconfig/20241031-093446-ladsgroup.json [09:35:22] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl1003.eqiad.wmnet [09:35:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl1003.eqiad.wmnet [09:35:49] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl1003.eqiad.wmnet with OS bookworm [09:36:05] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM; however I was wondering if we shouldn't make this a more general mechanism to send users to our "next" release on k8s, so make the c" [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [09:37:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad, AS64610/IPv6: Connect - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:34] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10280005 (10elukey) The mac address that Supermicro provided to us on the server's label is not correct, the last digit that we have is 8 meanwhile the M... [09:39:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10280007 (10elukey) Found the issue with 1044 (see T376121#10280005), I'll post an update as soon as Supermicro replies with the correct license. [09:43:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T376905)', diff saved to https://phabricator.wikimedia.org/P70735 and previous config saved to /var/cache/conftool/dbconfig/20241031-094314-ladsgroup.json [09:47:23] (03PS6) 10Arnaudb: mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) [09:47:36] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1003.eqiad.wmnet with reason: host reimage [09:49:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1003.eqiad.wmnet with reason: host reimage [09:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:57:40] (03CR) 10CI reject: [V:04-1] mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [09:58:18] (03PS1) 10Sergio Gimeno: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085339 (https://phabricator.wikimedia.org/T378573) [09:58:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P70736 and previous config saved to /var/cache/conftool/dbconfig/20241031-095821-ladsgroup.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1000) [10:02:15] (03PS1) 10Slyngshede: Docker image: Include missing dependencies, allow SQLite support. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085341 [10:02:43] (03PS1) 10Sergio Gimeno: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085342 (https://phabricator.wikimedia.org/T378573) [10:03:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db1232 in db1234 for T378267', diff saved to https://phabricator.wikimedia.org/P70737 and previous config saved to /var/cache/conftool/dbconfig/20241031-100301-arnaudb.json [10:03:07] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [10:03:20] (03Abandoned) 10Sergio Gimeno: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085339 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [10:04:08] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [10:04:12] (03PS1) 10Sergio Gimeno: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) [10:04:16] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db1232.eqiad.wmnet onto db1234.eqiad.wmnet [10:04:48] (03PS1) 10Sergio Gimeno: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085344 [10:05:13] (03PS1) 10Sergio Gimeno: build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085345 [10:05:35] (03PS1) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) [10:06:02] (03PS1) 10Sergio Gimeno: SpecialHomepage: show community update module based on variant [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) [10:06:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl1003.eqiad.wmnet with OS bookworm [10:09:08] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [10:09:49] (03PS2) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) [10:10:27] (03PS3) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) [10:10:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085342 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [10:11:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [10:12:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085344 (owner: 10Sergio Gimeno) [10:12:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085345 (owner: 10Sergio Gimeno) [10:13:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [10:13:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P70738 and previous config saved to /var/cache/conftool/dbconfig/20241031-101328-ladsgroup.json [10:13:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [10:15:29] (03PS2) 10Slyngshede: Docker image: Include missing dependencies, allow SQLite support. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085341 [10:21:32] (03CR) 10Slyngshede: [C:03+2] Docker image: Include missing dependencies, allow SQLite support. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085341 (owner: 10Slyngshede) [10:22:57] (03CR) 10CI reject: [V:04-1] Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [10:24:12] (03PS2) 10Sergio Gimeno: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) [10:24:18] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [10:24:32] (03Merged) 10jenkins-bot: Docker image: Include missing dependencies, allow SQLite support. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085341 (owner: 10Slyngshede) [10:24:33] (03PS2) 10Sergio Gimeno: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085344 [10:26:32] (03CR) 10CI reject: [V:04-1] build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085345 (owner: 10Sergio Gimeno) [10:28:08] (03CR) 10CI reject: [V:04-1] SpecialHomepage: show community update module based on variant [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [10:28:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T376905)', diff saved to https://phabricator.wikimedia.org/P70739 and previous config saved to /var/cache/conftool/dbconfig/20241031-102835-ladsgroup.json [10:28:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:28:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:33:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:34:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:34:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70740 and previous config saved to /var/cache/conftool/dbconfig/20241031-103406-ladsgroup.json [10:39:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10280225 (10phaultfinder) [10:40:39] (03Abandoned) 10Umherirrender: build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085345 (owner: 10Sergio Gimeno) [10:40:48] (03PS4) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) [10:44:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70741 and previous config saved to /var/cache/conftool/dbconfig/20241031-104404-ladsgroup.json [10:44:52] (03CR) 10Hnowlan: [C:03+1] shellbox-syntaxhighlight: add "migration" in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081266 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [10:52:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db[1232,1234].eqiad.wmnet with reason: hosts in cloning, avoiding alerts [10:52:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[1232,1234].eqiad.wmnet with reason: hosts in cloning, avoiding alerts [10:56:55] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl1003.eqiad.wmnet [10:56:56] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl1003.eqiad.wmnet [10:58:31] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1002.eqiad.wmnet [10:58:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1002.eqiad.wmnet [10:59:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P70742 and previous config saved to /var/cache/conftool/dbconfig/20241031-105910-ladsgroup.json [10:59:46] (03PS7) 10Volans: mysql_legacy: fix list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [11:00:08] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1002.eqiad.wmnet [11:01:25] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1002.eqiad.wmnet [11:02:46] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1002.eqiad.wmnet with OS bookworm [11:04:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad, AS64610/IPv6: Connect - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad, AS64610/IPv6: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:05:11] this is me --^ [11:07:24] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [11:14:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P70743 and previous config saved to /var/cache/conftool/dbconfig/20241031-111417-ladsgroup.json [11:16:59] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085308 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [11:17:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:37] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: host reimage [11:17:50] !log install haproxykafka on cp4037 and cp3066 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085308) (T378578) [11:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:54] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [11:20:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: host reimage [11:21:43] FIRING: OtelCollectorEnqueuedSpans: Some spans have been enqueued by exporter otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorEnqueuedSpans [11:24:46] (03PS1) 10Fabfur: Revert "hiera: enable haproxykafka on cp3066 and cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1085355 [11:25:28] (03CR) 10Fabfur: [C:03+2] Revert "hiera: enable haproxykafka on cp3066 and cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1085355 (owner: 10Fabfur) [11:26:06] !log reverted previous action (T378578) [11:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:11] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [11:29:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70744 and previous config saved to /var/cache/conftool/dbconfig/20241031-112924-ladsgroup.json [11:29:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:29:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:33:45] RECOVERY - mysqld processes on db1234 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:33:56] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:34:01] RECOVERY - MariaDB Replica SQL: s1 on db1234 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:34:15] RECOVERY - MariaDB Replica IO: s1 on db1234 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:34:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: Maintenance [11:34:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: Maintenance [11:34:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1237 (T376905)', diff saved to https://phabricator.wikimedia.org/P70746 and previous config saved to /var/cache/conftool/dbconfig/20241031-113456-ladsgroup.json [11:36:41] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T378692 (10phaultfinder) 03NEW [11:37:36] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:37:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10280310 (10VRiley-WMF) ms-be1083 Rack A4 U 37 Port 10 CableID: 202431 ms-be1085 Rack: B2 U 11 Port 7 CableID 5073 ms-be1084 Rack: B4 U 12 Port 18 Cab... [11:38:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1002.eqiad.wmnet with OS bookworm [11:41:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T376905)', diff saved to https://phabricator.wikimedia.org/P70747 and previous config saved to /var/cache/conftool/dbconfig/20241031-114158-ladsgroup.json [11:42:03] (03PS1) 10Stevemunene: netboot: create dedicated partman recipe for certain presto workers [puppet] - 10https://gerrit.wikimedia.org/r/1085357 (https://phabricator.wikimedia.org/T374924) [11:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10280330 (10phaultfinder) [11:45:22] (03PS2) 10Sergio Gimeno: SpecialHomepage: show community update module based on variant [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) [11:47:16] (03PS1) 10Sergio Gimeno: [DNM] Test CI [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085358 [11:47:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1232.eqiad.wmnet onto db1234.eqiad.wmnet [11:54:22] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database rskwiki (T375016) [11:54:27] T375016: Prepare and check storage layer for rskwiki - https://phabricator.wikimedia.org/T375016 [11:57:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P70750 and previous config saved to /var/cache/conftool/dbconfig/20241031-115705-ladsgroup.json [11:59:47] !log fnegri@cumin1002 END (ERROR) - Cookbook sre.wikireplicas.add-wiki (exit_code=97) for database rskwiki (T375016) [11:59:52] T375016: Prepare and check storage layer for rskwiki - https://phabricator.wikimedia.org/T375016 [11:59:56] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database rskwiki (T375016) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1200) [12:00:06] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database rskwiki (T375016) [12:00:14] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database tddwiki (T375016) [12:00:24] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tddwiki (T375016) [12:01:04] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database annwiki (T377118) [12:01:14] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database annwiki (T377118) [12:01:31] T377118: Prepare and check storage layer for annwiki - https://phabricator.wikimedia.org/T377118 [12:01:43] FIRING: [2x] OtelCollectorEnqueuedSpans: Some spans have been enqueued by exporter otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorEnqueuedSpans [12:02:36] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:06:17] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1002.eqiad.wmnet [12:06:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1002.eqiad.wmnet [12:09:00] (03PS6) 10Anzx: tcywikisource: Add namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) [12:10:19] (03PS1) 10Fabfur: haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) [12:12:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P70751 and previous config saved to /var/cache/conftool/dbconfig/20241031-121212-ladsgroup.json [12:12:49] (03CR) 10CI reject: [V:04-1] [DNM] Test CI [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085358 (owner: 10Sergio Gimeno) [12:16:19] (03PS1) 10Slyngshede: Fix unblock bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1085366 (https://phabricator.wikimedia.org/T378693) [12:27:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T376905)', diff saved to https://phabricator.wikimedia.org/P70752 and previous config saved to /var/cache/conftool/dbconfig/20241031-122719-ladsgroup.json [12:46:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [12:47:33] (03CR) 10Zabe: [C:03+1] wikitech: Stop loading the i18n for LdapAuthentication, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) (owner: 10Jforrester) [12:53:17] (03CR) 10Arnaudb: [C:03+1] "thanks 🙏" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [12:56:18] (03PS2) 10Fabfur: haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) [12:57:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [13:00:04] Urbanecm and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1300). [13:00:04] hnowlan, anzx, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] i can deploy today [13:00:15] * anzx 👋 [13:00:16] o/ [13:00:17] hi everyone! [13:00:18] o/ [13:00:48] just a heads-up, one of my changes is a noop cleanup and the other only takes effect once it hits the prod jobrunners [13:01:07] (03CR) 10Urbanecm: [C:03+2] Remove RunSingleJobStdin script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078700 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:01:18] hnowlan: ack, this is good to know [13:01:23] (03PS2) 10Hnowlan: TimedMediaHandler: use shellbox globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) [13:01:45] (03Merged) 10jenkins-bot: Remove RunSingleJobStdin script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078700 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:01:52] (03CR) 10Urbanecm: [C:03+2] TimedMediaHandler: use shellbox globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:02:12] (03CR) 10Urbanecm: [C:03+2] Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085342 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [13:02:12] (03CR) 10Urbanecm: [C:03+2] Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [13:02:14] (03CR) 10Urbanecm: [C:03+2] Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085344 (owner: 10Sergio Gimeno) [13:02:18] (03CR) 10Urbanecm: [C:03+2] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [13:02:24] (03CR) 10Urbanecm: [C:03+2] SpecialHomepage: show community update module based on variant [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:03:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:03:10] (03Merged) 10jenkins-bot: TimedMediaHandler: use shellbox globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:03:13] (03PS3) 10Fabfur: haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) [13:03:15] (03CR) 10Herron: [C:03+2] profile::syslog::centralserver: use prometheus cert for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [13:04:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1084200|TimedMediaHandler: use shellbox globally (T357309)]], [[gerrit:1078700|Remove RunSingleJobStdin script (T369048)]] [13:04:53] T357309: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309 [13:04:54] T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 [13:05:16] anzx: i see that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1084940 also adds some import sources. is that intended? i don't see that mentioned in the commit message, hence double checking [13:05:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [13:06:22] urbanecm: it was mentioned in task [13:07:08] anzx: ah, i see. thanks for confirming. [13:08:54] !log urbanecm@deploy2002 urbanecm, hnowlan: Backport for [[gerrit:1084200|TimedMediaHandler: use shellbox globally (T357309)]], [[gerrit:1078700|Remove RunSingleJobStdin script (T369048)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:58] finally [13:09:33] proceeding, given it only takes effect once at jobrunners [13:09:34] !log urbanecm@deploy2002 urbanecm, hnowlan: Continuing with sync [13:09:36] thanks [13:10:46] (03PS7) 10Anzx: tcywikisource: Add namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) [13:10:50] (03CR) 10Urbanecm: [C:03+2] tcywikisource: Add namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [13:10:56] (03PS7) 10Anzx: tcywiktionary: add SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) [13:10:59] (03CR) 10Urbanecm: [C:03+2] tcywiktionary: add SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [13:11:28] (03PS8) 10Anzx: tcywiktionary: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) [13:11:29] (03CR) 10Urbanecm: [C:03+2] tcywiktionary: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [13:11:34] (03Merged) 10jenkins-bot: tcywikisource: Add namespaces, SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084939 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [13:11:42] (03Merged) 10jenkins-bot: tcywiktionary: add SITENAME and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084940 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [13:12:11] (03Merged) 10jenkins-bot: tcywiktionary: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084307 (https://phabricator.wikimedia.org/T378556) (owner: 10Anzx) [13:14:27] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084200|TimedMediaHandler: use shellbox globally (T357309)]], [[gerrit:1078700|Remove RunSingleJobStdin script (T369048)]] (duration: 09m 43s) [13:14:33] T357309: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309 [13:14:33] T369048: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 [13:15:53] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1084939|tcywikisource: Add namespaces, SITENAME and timezone (T378555)]], [[gerrit:1084940|tcywiktionary: add SITENAME and timezone (T378556)]], [[gerrit:1084307|tcywiktionary: add logo (T378556)]] [13:15:59] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [13:15:59] T378556: Add logos, SITENAME and timezone for Tulu Wiktionary - https://phabricator.wikimedia.org/T378556 [13:18:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [13:18:29] !log urbanecm@deploy2002 anzx, urbanecm: Backport for [[gerrit:1084939|tcywikisource: Add namespaces, SITENAME and timezone (T378555)]], [[gerrit:1084940|tcywiktionary: add SITENAME and timezone (T378556)]], [[gerrit:1084307|tcywiktionary: add logo (T378556)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:57] urbanecm: checking [13:19:01] thanks [13:19:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [13:20:48] urbanecm: checked looks good [13:20:52] great! [13:20:54] !log urbanecm@deploy2002 anzx, urbanecm: Continuing with sync [13:20:56] proceeding [13:22:38] urbanecm: I've realised there was an issue with my change, I have a change that I might need to get in [13:22:54] (03Merged) 10jenkins-bot: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085342 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [13:22:55] hnowlan: ack, do you want me to revert the commit? or just wait for the follow-up? [13:23:08] urbanecm: the follow-up should be ready in a second, thanks [13:23:14] ack [13:23:16] (03Merged) 10jenkins-bot: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085343 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [13:23:19] (03Merged) 10jenkins-bot: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085344 (owner: 10Sergio Gimeno) [13:24:58] (03PS1) 10Hnowlan: commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085383 (https://phabricator.wikimedia.org/T357309) [13:25:32] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084939|tcywikisource: Add namespaces, SITENAME and timezone (T378555)]], [[gerrit:1084940|tcywiktionary: add SITENAME and timezone (T378556)]], [[gerrit:1084307|tcywiktionary: add logo (T378556)]] (duration: 09m 39s) [13:25:34] urbanecm: this is the fix, thanks https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1085383 [13:25:45] (03CR) 10Urbanecm: [C:03+2] commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085383 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:25:46] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [13:25:46] T378556: Add logos, SITENAME and timezone for Tulu Wiktionary - https://phabricator.wikimedia.org/T378556 [13:25:56] hnowlan: ack, proceeding. i presume that is also testable only at jobrunners? [13:26:07] urbanecm: yes, unfortunately [13:26:08] (03Merged) 10jenkins-bot: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085346 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [13:26:13] okay [13:26:36] (03Merged) 10jenkins-bot: commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085383 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [13:26:49] (03Merged) 10jenkins-bot: SpecialHomepage: show community update module based on variant [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1085347 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:28:03] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1085342|Set username in user mock and reset state after test (T378573)]], [[gerrit:1085343|Fix and re-enable selenium test (T378581)]], [[gerrit:1085344|Fix selenium test loading the wrong talk page]], [[gerrit:1085346|HomepageHooks: do not store assigned variant on account creation (T377713)]], [[gerrit:1085347|SpecialHomepage: show community update [13:28:04] module based on variant (T377233)]], [[gerrit:1085383|commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH (T357309)]] [13:28:11] T378573: Wikibase CI blocked by errors in SkinAfterPortletHandlerTest and ChangesListSpecialPageHookHandlerTest - https://phabricator.wikimedia.org/T378573 [13:28:11] T378581: Re-enable browser test in repo/tests/selenium/specs/item.js - https://phabricator.wikimedia.org/T378581 [13:28:11] T377713: Do not call ExperimentUserManager::setVariant on all newly registered accounts - https://phabricator.wikimedia.org/T377713 [13:28:12] T377233: Show Community updates module based on experiment variant - https://phabricator.wikimedia.org/T377233 [13:28:12] T357309: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309 [13:29:59] sergi0: backports will be at mwdebug shortly [13:30:09] ack [13:30:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.546s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:30:31] that...doesn't sound great [13:30:36] !log urbanecm@deploy2002 hnowlan, sgimeno, urbanecm: Backport for [[gerrit:1085342|Set username in user mock and reset state after test (T378573)]], [[gerrit:1085343|Fix and re-enable selenium test (T378581)]], [[gerrit:1085344|Fix selenium test loading the wrong talk page]], [[gerrit:1085346|HomepageHooks: do not store assigned variant on account creation (T377713)]], [[gerrit:1085347|SpecialHomepage: show community upda [13:30:36] te module based on variant (T377233)]], [[gerrit:1085383|commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH (T357309)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10280802 (10phaultfinder) [13:31:00] urbanecm: this patch was not deployed yet https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1084306 [13:31:02] sergi0: please test [13:31:07] checking [13:31:12] looking at the parsoid issue [13:31:31] anzx: yes, i know. i'm processing the patches in a different order than at the calendar [13:31:41] ok [13:31:47] i'll get to that soon [13:32:51] urbanecm: OK on my end [13:32:58] thanks! [13:34:08] !log urbanecm@deploy2002 hnowlan, sgimeno, urbanecm: Continuing with sync [13:35:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.23s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:38:10] (03PS6) 10Anzx: tcywikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) [13:38:47] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085342|Set username in user mock and reset state after test (T378573)]], [[gerrit:1085343|Fix and re-enable selenium test (T378581)]], [[gerrit:1085344|Fix selenium test loading the wrong talk page]], [[gerrit:1085346|HomepageHooks: do not store assigned variant on account creation (T377713)]], [[gerrit:1085347|SpecialHomepage: show community update [13:38:47] module based on variant (T377233)]], [[gerrit:1085383|commons: revert accidental wmgUsePdfHandlerShellbox change, enable TMH (T357309)]] (duration: 10m 43s) [13:38:54] T378573: Wikibase CI blocked by errors in SkinAfterPortletHandlerTest and ChangesListSpecialPageHookHandlerTest - https://phabricator.wikimedia.org/T378573 [13:38:55] T378581: Re-enable browser test in repo/tests/selenium/specs/item.js - https://phabricator.wikimedia.org/T378581 [13:38:55] T377713: Do not call ExperimentUserManager::setVariant on all newly registered accounts - https://phabricator.wikimedia.org/T377713 [13:38:55] T377233: Show Community updates module based on experiment variant - https://phabricator.wikimedia.org/T377233 [13:38:56] T357309: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309 [13:39:14] hnowlan: the fix should be now in production [13:39:20] thank you very much! [13:39:31] behaviour looks like what I'd expect [13:39:34] great! [13:39:51] sergi0: and `ge.utils.getUserVariant()` now produces `community-updates-module`. i guess that means all works? [13:40:14] (03CR) 10Urbanecm: [C:03+2] tcywikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [13:40:49] (03Merged) 10jenkins-bot: tcywikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084306 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [13:40:59] urbanecm: Indeed, now rows created. But the miss-match between config and code for the last 12h has produced some extra "control" rows :/ I will cleanup them later with userOptions [13:41:06] *no rows [13:41:50] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1084306|tcywikisource: add logo (T378555)]] [13:41:57] well, what can we do :/ fixing shouldn't hopefully take as long as last time [13:42:08] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [13:44:21] !log urbanecm@deploy2002 urbanecm, anzx: Backport for [[gerrit:1084306|tcywikisource: add logo (T378555)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:44:21] sergi0: should we also enable the campaigns now? or does that need to wait for...something? [13:44:27] urbanecm: checking [13:44:29] thanks [13:45:09] urbanecm: I think we're good to enable campaigns if we want. re cleanup: yes, only 4 wikis affected and few rows [13:45:34] urbanecm: looks good [13:46:03] !log urbanecm@deploy2002 urbanecm, anzx: Continuing with sync [13:46:22] anzx: thanks, proceeding [13:47:04] sergi0: yeah, i'm not 100% sure we want to do that now (whether deployment was the only blocker or not) [13:47:20] but if you don't know, i can check with Kirsten when she gets online [13:47:53] let's check. Thank you for the assistance :) [13:48:01] sounds good [13:50:46] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084306|tcywikisource: add logo (T378555)]] (duration: 08m 56s) [13:50:52] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [13:50:52] anzx: and live [13:50:54] anything else? [13:51:08] urbanecm: nothing else, Thank you for deploying [13:51:12] no problem [13:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:52:28] (03CR) 10Vgutierrez: [C:03+1] "looks good, but as mentioned in other CRs this would be better split in two CRs (one for user creation and another one with the socket pat" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:03:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: post db1234.eqiad.wmnet clone', diff saved to https://phabricator.wikimedia.org/P70753 and previous config saved to /var/cache/conftool/dbconfig/20241031-140345-arnaudb.json [14:04:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:04:42] (03PS1) 10Elukey: role::aux_k8s: clean up after containerd migration [puppet] - 10https://gerrit.wikimedia.org/r/1085391 (https://phabricator.wikimedia.org/T378345) [14:05:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 1%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70754 and previous config saved to /var/cache/conftool/dbconfig/20241031-140459-arnaudb.json [14:05:17] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [14:06:13] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4436/co" [puppet] - 10https://gerrit.wikimedia.org/r/1085391 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [14:06:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70755 and previous config saved to /var/cache/conftool/dbconfig/20241031-140653-arnaudb.json [14:08:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10280958 (10ABran-WMF) 05Open→03Resolved host is repooling after a reclone [14:09:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:10:27] (03PS4) 10Fabfur: haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) [14:11:25] !log eswiki, arwiki, cswiki, frwiki running `mwscript userOptions.php --wiki=frwiki --delete-defaults growthexperiments-homepage-variant` (T374664) [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:30] T374664: T374577: Community Updates module: Release to Growth Pilot Wikipedias - https://phabricator.wikimedia.org/T374664 [14:12:01] (03CR) 10Ssingh: haproxykafka: create user with puppet (and not with deb package) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:12:12] (03CR) 10Ssingh: haproxykafka: create user with puppet (and not with deb package) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:12:30] (03PS1) 10Fabfur: hiera: fix path for haproxykafka socket [puppet] - 10https://gerrit.wikimedia.org/r/1085395 (https://phabricator.wikimedia.org/T377614) [14:12:54] (03PS5) 10Fabfur: haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) [14:13:03] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:13:37] (03CR) 10Elukey: [V:03+1] "Something is weird with pcc and aux-k8s-ctrl1002, namely it thinks that profile::docker::engine is added but cumin says the opposite:" [puppet] - 10https://gerrit.wikimedia.org/r/1085391 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [14:14:00] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085395 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:14:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:14:27] !log Running `foreachwiki userOptions.php --delete --old=sectionlevelimages growthexperiments-homepage-variant` (T375753) [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:38] T375753: Drop unnecessary growthexperiments-homepage-variant entries from user_properties at Wikimedia wikis - https://phabricator.wikimedia.org/T375753 [14:16:32] (03CR) 10Ssingh: [C:03+1] hiera: fix path for haproxykafka socket [puppet] - 10https://gerrit.wikimedia.org/r/1085395 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:17:49] (03CR) 10Fabfur: [C:03+2] hiera: fix path for haproxykafka socket [puppet] - 10https://gerrit.wikimedia.org/r/1085395 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:17:55] (03PS1) 10Cwhite: phatality: use package provider to install phatality [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) [14:18:31] (03CR) 10CI reject: [V:04-1] phatality: use package provider to install phatality [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [14:18:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: post db1234.eqiad.wmnet clone', diff saved to https://phabricator.wikimedia.org/P70756 and previous config saved to /var/cache/conftool/dbconfig/20241031-141851-arnaudb.json [14:19:20] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bclwikisource (T377087) [14:19:31] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database bclwikisource (T377087) [14:19:31] T377087: Prepare and check storage layer for bclwikisource - https://phabricator.wikimedia.org/T377087 [14:20:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 2%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70757 and previous config saved to /var/cache/conftool/dbconfig/20241031-142004-arnaudb.json [14:20:19] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [14:21:12] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database ibawiki (T376571) [14:21:22] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database ibawiki (T376571) [14:21:25] T376571: Prepare and check storage layer for ibawiki - https://phabricator.wikimedia.org/T376571 [14:21:52] (03PS2) 10Cwhite: phatality: use package provider to install phatality [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) [14:21:55] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database tcywiktionary (T378462) [14:21:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70758 and previous config saved to /var/cache/conftool/dbconfig/20241031-142158-arnaudb.json [14:22:00] T378462: Prepare and check storage layer for tcywiktionary - https://phabricator.wikimedia.org/T378462 [14:22:05] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tcywiktionary (T378462) [14:22:27] (03CR) 10CI reject: [V:04-1] phatality: use package provider to install phatality [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [14:23:20] (03PS3) 10Cwhite: phatality: use package provider to install phatality [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) [14:23:58] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database tcywikisource (T378469) [14:24:03] T378469: Prepare and check storage layer for tcywikisource - https://phabricator.wikimedia.org/T378469 [14:24:08] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tcywikisource (T378469) [14:26:11] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [14:32:46] PROBLEM - Host db2190 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:32:52] on it [14:32:56] o/ [14:33:00] ack [14:33:03] thanks arnaudb, lemme know if you need help [14:33:06] !incidents [14:33:06] 5360 (UNACKED) Host db2190 (paged) - PING - Packet loss = 100% [14:33:13] !ack 5360 [14:33:14] 5360 (ACKED) Host db2190 (paged) - PING - Packet loss = 100% [14:33:14] host is unresponsive 100%, will check its topology [14:33:55] if you need help, shout [14:33:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: post db1234.eqiad.wmnet clone', diff saved to https://phabricator.wikimedia.org/P70759 and previous config saved to /var/cache/conftool/dbconfig/20241031-143356-arnaudb.json [14:34:06] "Seen 24h ago" Says orchestrator [14:34:40] yeah it is not pooled [14:34:43] T378628 [14:34:43] looks like a maint host [14:34:43] T378628: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628 [14:34:46] ah :) [14:34:53] 24 downtime expired ? [14:34:56] so downtime expired? [14:34:59] that would be my guess [14:34:59] <-- too slow [14:35:00] yeah [14:35:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 4%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70760 and previous config saved to /var/cache/conftool/dbconfig/20241031-143510-arnaudb.json [14:35:16] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [14:35:20] will deal with it, sorry for the noise! elukey you can recover the incident :) [14:35:42] * Emperor resolved [14:36:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084857 yet it has notifications disabled 🤔 [14:36:25] arnaudb: that's what I was discussing before [14:36:35] yeah I remember, this adds to the lead [14:36:38] AFAICT you need a real downtime, not "just" notifications disabled [14:36:43] it seems something on puppet/obs stack is missing [14:37:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70761 and previous config saved to /var/cache/conftool/dbconfig/20241031-143704-arnaudb.json [14:37:10] Emperor: but on puppet notifications for the host are disabled- that may be needed, but not what I would expect [14:37:27] or at least it is a regression of how icinga worked [14:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2190.codfw.wmnet with reason: host has hardware issues T378628 [14:37:51] downtimed until monday (cc Amir1 ↑) [14:37:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2190.codfw.wmnet with reason: host has hardware issues T378628 [14:38:16] (03CR) 10Fabfur: [C:03+2] haproxykafka: create user with puppet (and not with deb package) [puppet] - 10https://gerrit.wikimedia.org/r/1085363 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:39:55] arnaudb: can I give a second try to the reboot? I don't think it will work, but we won't lose anything [14:41:05] go for it jynus it's downtimed anyway [14:43:07] I will try to do a hard reset and then a racreset, but it is not looking responsive at all [14:43:48] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085399 (https://phabricator.wikimedia.org/T377614) [14:44:00] PSU issue jynus ? [14:44:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085399 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:44:58] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10281131 (10Ladsgroup) I disabled notifications. Downtime should have not been needed but clearly didn't work :/ [14:45:51] arnaudb: no idea, but it is very unresponsive [14:46:36] sad server is sad [14:47:14] Yeah. I tried different ways to boot it up. Via IPMI too [14:47:19] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: add "migration" in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081266 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [14:48:08] (03CR) 10Ssingh: [C:03+1] hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085399 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:48:18] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: add "migration" in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081266 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [14:48:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10281138 (10ABran-WMF) p:05Triage→03Medium [14:49:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: post db1234.eqiad.wmnet clone', diff saved to https://phabricator.wikimedia.org/P70762 and previous config saved to /var/cache/conftool/dbconfig/20241031-144902-arnaudb.json [14:50:12] arnaudb: even ressetting it, it still doesn't work [14:50:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 5%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70763 and previous config saved to /var/cache/conftool/dbconfig/20241031-145015-arnaudb.json [14:50:20] it doesn't necesarilly is something toasted [14:50:21] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [14:50:31] sometimes it just need power drain and it comes back [14:50:38] but it could have some part toasted [14:52:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70764 and previous config saved to /var/cache/conftool/dbconfig/20241031-145209-arnaudb.json [14:53:11] (03CR) 10Tiziano Fogli: "Comments inline." [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:54:52] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on cp3066 and cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1085399 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:56:59] (03PS1) 10Cwhite: beta-logs: set phatality version to 2.7.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085409 (https://phabricator.wikimedia.org/T342476) [14:57:36] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:39] 06SRE, 06Traffic: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724 (10ssingh) 03NEW [14:58:56] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:58] 06SRE, 06Traffic: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10281202 (10ssingh) [14:59:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:00:05] dduvall and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1500). [15:00:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:02:11] 06SRE, 06Traffic: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10281206 (10ssingh) p:05Triage→03Medium [15:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:07] (03CR) 10Brouberol: [C:03+1] netboot: create dedicated partman recipe for certain presto workers [puppet] - 10https://gerrit.wikimedia.org/r/1085357 (https://phabricator.wikimedia.org/T374924) (owner: 10Stevemunene) [15:04:21] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10281209 (10jcrespo) {F57663857} [15:05:04] (03PS1) 10Giuseppe Lavagetto: New deployment [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1085413 [15:05:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70765 and previous config saved to /var/cache/conftool/dbconfig/20241031-150521-arnaudb.json [15:05:26] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [15:05:31] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New deployment [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1085413 (owner: 10Giuseppe Lavagetto) [15:06:53] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add tooltips to expressions - oblivian@cumin1002" [15:06:57] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add tooltips to expressions - oblivian@cumin1002 [15:06:57] (03CR) 10Volans: [C:03+2] mysql_legacy: fix list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [15:07:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70766 and previous config saved to /var/cache/conftool/dbconfig/20241031-150714-arnaudb.json [15:07:32] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add tooltips to expressions - oblivian@cumin1002 [15:07:33] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add tooltips to expressions - oblivian@cumin1002" [15:07:36] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:35] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:08:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:15:45] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [15:16:42] (03Merged) 10jenkins-bot: mysql_legacy: fix list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [15:17:44] (03PS56) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [15:17:44] (03CR) 10Arnaudb: "all review points accounted for 👍" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [15:20:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70767 and previous config saved to /var/cache/conftool/dbconfig/20241031-152026-arnaudb.json [15:20:32] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [15:20:40] memory holds on so far :D [15:20:54] (03PS1) 10Fabfur: Revert "hiera: enable haproxykafka on cp3066 and cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1085419 [15:21:43] RESOLVED: [2x] OtelCollectorEnqueuedSpans: Some spans have been enqueued by exporter otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorEnqueuedSpans [15:22:17] (03CR) 10Fabfur: [C:03+2] Revert "hiera: enable haproxykafka on cp3066 and cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1085419 (owner: 10Fabfur) [15:22:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P70768 and previous config saved to /var/cache/conftool/dbconfig/20241031-152220-arnaudb.json [15:24:46] (03PS4) 10Scott French: shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) [15:31:39] 06SRE: Netbox script for adding secondary IPs starts sequencing at 'b' - https://phabricator.wikimedia.org/T378730 (10Eevans) 03NEW [15:32:29] (03PS1) 10Fabfur: haproxykafka: use user function due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) [15:32:36] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:35:24] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T378646#10281377 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated. alert cleared. [15:35:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70769 and previous config saved to /var/cache/conftool/dbconfig/20241031-153531-arnaudb.json [15:35:54] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [15:35:56] !log eevans@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:41:34] 06SRE: Netbox script for adding secondary IPs - https://phabricator.wikimedia.org/T378730#10281408 (10Eevans) p:05Triage→03Medium [15:42:10] 06SRE, 06Infrastructure-Foundations, 10netbox: Netbox script for adding secondary IPs - https://phabricator.wikimedia.org/T378730#10281414 (10taavi) [15:43:35] (03CR) 10Samtar: [C:03+1] "lgtm, can deploy whenever :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [15:43:51] jouncebot: nowandnext [15:43:51] For the next 0 hour(s) and 16 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1500) [15:43:51] In 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1600) [15:44:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2190'] [15:45:12] RECOVERY - Host db2190 #page is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [15:45:15] going to deploy a config change (1084853) unless anyone yells [15:45:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2190'] [15:45:23] 06SRE, 06Infrastructure-Foundations, 10netbox: Netbox script for adding secondary IPs - https://phabricator.wikimedia.org/T378730#10281435 (10Eevans) Also, (I just noticed) one of the secondaries is 10.64.0.0/12...? [15:45:33] nice [15:46:05] (03CR) 10Vgutierrez: [C:04-1] "please re-add the user resource check on the spec test" [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:46:07] (03PS4) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [15:46:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [15:46:35] TheresNoTime: you're fine to run over into the puppet window in 14m if you need, there's nothing planned [15:46:42] (03CR) 10Ssingh: [C:03+1] "Looks good, doc string needs an update." [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:46:53] (03Merged) 10jenkins-bot: [CommunityRequests] disable wgCommunityRequestsEnable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [15:47:01] ack :) should* only be a quick one ^ [15:47:16] jouncebot: nowandnext [15:47:16] For the next 0 hour(s) and 12 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1500) [15:47:16] In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1600) [15:47:23] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1084853|[CommunityRequests] disable wgCommunityRequestsEnable by default (T366194)]] [15:47:28] T366194: Migrate Community Wishlist to CommunityRequests extension - https://phabricator.wikimedia.org/T366194 [15:47:33] may all your shoulds come true [15:49:14] rzl: mind if I go a bit more into your window? [15:49:49] !log samtar@deploy2002 samtar, musikanimal: Backport for [[gerrit:1084853|[CommunityRequests] disable wgCommunityRequestsEnable by default (T366194)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:49:55] * TheresNoTime looks [15:50:36] !log samtar@deploy2002 samtar, musikanimal: Continuing with sync [15:50:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70770 and previous config saved to /var/cache/conftool/dbconfig/20241031-155037-arnaudb.json [15:50:46] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [15:52:39] (03PS3) 10Majavah: Drop labtestwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) [15:52:39] (03PS3) 10Majavah: Stop building LdapAuthentication i10n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) [15:52:39] (03PS2) 10Majavah: Drop 'nonglobal' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 [15:53:47] (03CR) 10Jforrester: [C:04-1] Stop building LdapAuthentication i10n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) (owner: 10Majavah) [15:55:15] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084853|[CommunityRequests] disable wgCommunityRequestsEnable by default (T366194)]] (duration: 07m 51s) [15:55:19] (03Abandoned) 10Majavah: Stop building LdapAuthentication i10n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) (owner: 10Majavah) [15:55:22] T366194: Migrate Community Wishlist to CommunityRequests extension - https://phabricator.wikimedia.org/T366194 [15:55:43] (03PS3) 10Majavah: Drop 'nonglobal' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 [15:58:02] Taavi: Can I deploy https://gerrit.wikimedia.org/r/1083493 for you when it is ready? I need to test a scap thingy [15:58:43] dancy: sure, but I need to deploy 1083304 first. currently still waiting confirmation that I can go over the Puppet window [15:59:13] OK Thanks. 1083304 works for me too [15:59:17] (03PS2) 10Fabfur: haproxykafka: use user function due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) [15:59:39] (03CR) 10Fabfur: haproxykafka: use user function due to bug in systemd::sysusers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:59:46] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [15:59:53] I'm fine with both :-) [16:00:02] ok. I'll be around. [16:00:05] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:31] jhathaway: rzl: hi! since there does not seem to be anything for the puppet window, can we please steal your window for some mediawiki patches? [16:02:48] be our guest! [16:03:00] (03CR) 10Scott French: "The equivalent patch for syntaxhighlight applies cleanly and does what's expected - i.e., deployments exist, 0 pods, serves no traffic. If" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:03:00] thx [16:03:03] dancy: go for it :-) [16:03:10] ok.. [16:03:29] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — aqs1022 - eevans@cumin1002" [16:03:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [16:04:18] (03Merged) 10jenkins-bot: Drop labtestwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083304 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [16:04:48] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1083304|Drop labtestwiki config (T378260)]] [16:04:49] !log dancy@deploy2002 scap failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/home/dancy/src/venvs/scap/bin/scap', 'mwshell', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.28/cache/l10n/*.tmp.*']' returned non-zero exit status 1. (scap version: 4.118.0) (duration: 00m 01s) [16:04:53] T378260: Retire labtestwiki - https://phabricator.wikimedia.org/T378260 [16:05:10] that's not promising [16:05:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70772 and previous config saved to /var/cache/conftool/dbconfig/20241031-160542-arnaudb.json [16:05:47] T378267: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267 [16:06:05] It's ok. sudo rules not set up to allow me to sudo my test scap. Makes sense. The test did "work" in some sense. [16:06:12] OK. Handing back over to you. Nothing deployed. [16:06:29] ack. and a normal scap backport should work fine now? [16:06:33] yeah [16:06:38] thanks [16:06:45] !log [archiva] Freed up space on `archiva1002.wikimedia.org` like so: `sudo rm -rfv /var/cache/archiva/temp* && sudo systemctl restart archiva`. We're down to 31% usage now [16:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] Thanks for letting me test w/ your change! [16:07:10] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1083304|Drop labtestwiki config (T378260)]] [16:07:14] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — aqs1022 - eevans@cumin1002" [16:07:14] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:32] np :-) if that breaks all the wikis at least I'd had an excuse :P [16:09:38] !log taavi@deploy2002 taavi: Backport for [[gerrit:1083304|Drop labtestwiki config (T378260)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:10:02] (03CR) 10CDanis: [C:03+1] "not sure what's up with pcc but patch lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1085391 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [16:12:08] !log taavi@deploy2002 taavi: Continuing with sync [16:16:49] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1083304|Drop labtestwiki config (T378260)]] (duration: 09m 39s) [16:17:03] T378260: Retire labtestwiki - https://phabricator.wikimedia.org/T378260 [16:18:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [16:18:49] (03Merged) 10jenkins-bot: Drop 'nonglobal' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [16:19:14] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1083493|Drop 'nonglobal' dblist]] [16:20:59] 06SRE, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742 (10CDanis) 03NEW [16:21:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10281698 (10MatthewVernon) That's //not ideal// :( On a Dell system: ` mvernon@ms-be1081:~$ ls /dev/disk/by-path/*-part4 /dev/disk/by-path/pci-0000:18:00.... [16:21:45] !log taavi@deploy2002 taavi: Backport for [[gerrit:1083493|Drop 'nonglobal' dblist]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:22:35] 06SRE, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10281702 (10CDanis) [16:23:05] !log taavi@deploy2002 taavi: Continuing with sync [16:24:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10281707 (10MatthewVernon) >>! In T371400#10279452, @Papaul wrote: > It looks like my number 3 conclusion was the solution for the other missing 12 disk.... [16:24:26] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:26:21] (03PS1) 10Eevans: aqs1022: Provision new Cassandra host [puppet] - 10https://gerrit.wikimedia.org/r/1085430 (https://phabricator.wikimedia.org/T378725) [16:27:58] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1083493|Drop 'nonglobal' dblist]] (duration: 08m 44s) [16:28:14] (03CR) 10Stevemunene: [C:03+2] netboot: create dedicated partman recipe for certain presto workers [puppet] - 10https://gerrit.wikimedia.org/r/1085357 (https://phabricator.wikimedia.org/T374924) (owner: 10Stevemunene) [16:31:10] (03CR) 10Arnaudb: [C:03+1] "everything looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1085430 (https://phabricator.wikimedia.org/T378725) (owner: 10Eevans) [16:31:56] (03CR) 10Eevans: [C:03+2] aqs1022: Provision new Cassandra host [puppet] - 10https://gerrit.wikimedia.org/r/1085430 (https://phabricator.wikimedia.org/T378725) (owner: 10Eevans) [16:32:23] (03CR) 10Majavah: "Yeah, labtestwiki is apparently borked enough that this could go in before the wiki is dropped. I'll abandon in favour of your patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) (owner: 10Majavah) [16:32:34] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Remove unused bin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/1085431 [16:32:40] (03CR) 10Majavah: [C:03+1] wikitech: Stop loading the i18n for LdapAuthentication, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078105 (https://phabricator.wikimedia.org/T371592) (owner: 10Jforrester) [16:32:59] (03PS2) 10Majavah: Drop config for serving labtestwiki [puppet] - 10https://gerrit.wikimedia.org/r/1083585 (https://phabricator.wikimedia.org/T378260) [16:33:11] (03CR) 10Jforrester: "Ha. Oh well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083305 (https://phabricator.wikimedia.org/T371592) (owner: 10Majavah) [16:34:01] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: use user function due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:34:31] (03CR) 10Majavah: [C:03+2] Drop config for serving labtestwiki [puppet] - 10https://gerrit.wikimedia.org/r/1083585 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [16:35:07] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp3066 and cp4037 (again) [puppet] - 10https://gerrit.wikimedia.org/r/1085433 (https://phabricator.wikimedia.org/T377614) [16:35:54] (03CR) 10Ssingh: [C:03+1] "[nit] commit message but just merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:36:05] (03PS3) 10Fabfur: haproxykafka: use user resource due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) [16:36:13] (03PS1) 10MVernon: set apus scrape interval to 15s [puppet] - 10https://gerrit.wikimedia.org/r/1085434 (https://phabricator.wikimedia.org/T279621) [16:36:18] (03CR) 10Ssingh: haproxykafka: use user resource due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:37:00] (03CR) 10Fabfur: [C:03+2] haproxykafka: use user resource due to bug in systemd::sysusers [puppet] - 10https://gerrit.wikimedia.org/r/1085424 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:37:09] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on cp3066 and cp4037 (again) [puppet] - 10https://gerrit.wikimedia.org/r/1085433 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:37:50] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [16:42:03] jouncebot: nowandnext [16:42:03] For the next 0 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1600) [16:42:03] In 0 hour(s) and 17 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1700) [16:42:03] In 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1700) [16:42:08] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Remove unused bin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/1085431 (owner: 10Ahmon Dancy) [16:42:20] thx mutante! [16:42:36] yw [16:45:09] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [16:49:33] (03CR) 10Scott French: "Thank you very much for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [16:51:17] (03CR) 10Eevans: [C:03+1] set apus scrape interval to 15s [puppet] - 10https://gerrit.wikimedia.org/r/1085434 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:51:33] (03CR) 10MVernon: [C:03+2] set apus scrape interval to 15s [puppet] - 10https://gerrit.wikimedia.org/r/1085434 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:52:10] (03CR) 10Scott French: "Valentin, as we chatted about the other day, running this by you as well in case you had any thoughts. Thanks in advance!" [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [16:52:28] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1022.eqiad.wmnet with reason: Bootstrapping — T378725 [16:52:33] T378725: Refresh aqs1013 w/ aqs1022 - https://phabricator.wikimedia.org/T378725 [16:52:36] FIRING: [8x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1022.eqiad.wmnet with reason: Bootstrapping — T378725 [16:53:48] RECOVERY - Disk space on snapshot1012 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=snapshot1012&var-datasource=eqiad+prometheus/ops [16:53:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10281866 (10Papaul) @MatthewVernon if looks like you was reading my mind. I already ask SuperMicro the same question. and I tested it too you have to enab... [16:53:49] (03PS1) 10Fabfur: haproxykafka: fixed default permissions for socket [puppet] - 10https://gerrit.wikimedia.org/r/1085436 (https://phabricator.wikimedia.org/T377614) [16:54:05] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.15.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1085437 [16:54:18] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.15.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1085437 (owner: 10Volans) [16:55:11] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#10281872 (10herron) >>! In T359293#10280713, @gerritbot wrote: > Change #1084199 **merged** by Herron: > %%%[operations/puppet@production] profile::syslog::cent... [16:55:22] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085436 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:55:59] !log Bootstrapping Cassandra/aqs1022-a — T378725 [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:36] FIRING: [8x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:42] !log set mgr mgr/prometheus/scrape_interval 15.0 in both apus clusters [16:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1700). [17:00:05] swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1700) [17:00:26] nothing to do in my window today [17:00:36] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1020.eqiad.wmnet with OS bullseye [17:00:52] here, and will start work shortly [17:01:23] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bullseye [17:02:10] dancy: any concerns about deploying with scap following https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085431 being merged? [17:02:28] No concerns [17:02:37] dancy: ack, thanks! [17:03:32] (03CR) 10Scott French: [C:03+2] mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:04:23] (03PS1) 10Majavah: Drop unused passwords::wikitech [labs/private] - 10https://gerrit.wikimedia.org/r/1085441 [17:04:23] (03PS1) 10Majavah: Drop unused Wikitech settings [labs/private] - 10https://gerrit.wikimedia.org/r/1085442 [17:04:25] (03PS1) 10Majavah: Drop bunch of unused Hiera [labs/private] - 10https://gerrit.wikimedia.org/r/1085443 [17:05:17] (03Merged) 10jenkins-bot: mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:06:51] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.15.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1085437 (owner: 10Volans) [17:07:36] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:07:41] (03PS2) 10Scardenasmolinar: [WIP]Enable AutoModerator on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) [17:07:48] (03PS3) 10Scardenasmolinar: Enable AutoModerator on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) [17:08:13] (03PS3) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [17:09:14] (03PS1) 10Volans: Upstream release v8.15.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1085445 [17:09:26] (03CR) 10Volans: [C:03+2] Upstream release v8.15.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1085445 (owner: 10Volans) [17:09:47] (03CR) 10CI reject: [V:04-1] WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [17:09:59] (03CR) 10Ssingh: [C:03+1] "PCC failure is because of incorrect hostname for cp4037 but looks OK otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/1085436 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [17:11:08] !log swfrench@deploy2002 Started scap sync-world: Deployment to pick up PHP version parameterization - T372604 T377040 [17:11:20] (03PS1) 10Ssingh: P:cache::varnish::frontend: add check for minimum frontend cache size [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) [17:11:29] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:11:29] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [17:12:32] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4437/console" [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:13:00] !log swfrench@deploy2002 Finished scap sync-world: Deployment to pick up PHP version parameterization - T372604 T377040 (duration: 01m 52s) [17:14:38] (03PS2) 10Ssingh: P:cache::varnish::frontend: add check for minimum frontend cache size [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) [17:16:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:16:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:17:33] alright, I am done with the infrastructure window [17:18:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:18:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [17:18:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T376905)', diff saved to https://phabricator.wikimedia.org/P70774 and previous config saved to /var/cache/conftool/dbconfig/20241031-171824-ladsgroup.json [17:20:33] (03Merged) 10jenkins-bot: Upstream release v8.15.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1085445 (owner: 10Volans) [17:25:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10282095 (10phaultfinder) [17:26:11] !log uploaded spicerack_8.15.2 to apt.wikimedia.org bullseye-wikimedia [17:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T376905)', diff saved to https://phabricator.wikimedia.org/P70775 and previous config saved to /var/cache/conftool/dbconfig/20241031-172637-ladsgroup.json [17:27:58] (03PS1) 10Ottomata: Add airflow connection conf for datahub [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) [17:28:53] (03CR) 10Ottomata: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [17:29:14] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1085450 [17:29:45] (03CR) 10Fabfur: [C:03+2] haproxykafka: fixed default permissions for socket [puppet] - 10https://gerrit.wikimedia.org/r/1085436 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [17:33:56] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:35:10] (03PS2) 10Ebernhardson: WIP: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [17:38:56] (03PS4) 10Scott French: gateway-check: fix invalid config handling [puppet] - 10https://gerrit.wikimedia.org/r/1084247 [17:38:56] (03CR) 10Scott French: "Happened to notice this after doing the same thing in Idbed74342ba7240cd00364674ae739b82d9cd3d9 and then finding it didn't work as expecte" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [17:40:58] (03PS3) 10Herron: rsyslog::receiver: add hostname and fqdn to certificate names [puppet] - 10https://gerrit.wikimedia.org/r/1085450 (https://phabricator.wikimedia.org/T359293) [17:41:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P70776 and previous config saved to /var/cache/conftool/dbconfig/20241031-174144-ladsgroup.json [17:42:36] (03CR) 10Ladsgroup: "Volans: We don't use s11 anywhere in dbctl, right? I can't think of any." [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [17:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [17:53:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1085450 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [17:55:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 294570704 and 62 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:56:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P70777 and previous config saved to /var/cache/conftool/dbconfig/20241031-175651-ladsgroup.json [17:56:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:57:02] (03PS2) 10Anzx: tcywikisource: fix typo of author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085454 (https://phabricator.wikimedia.org/T378555) [17:57:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085454 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [17:59:32] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [18:00:04] dduvall and dancy: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1800). [18:02:50] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.cod [18:02:50] , mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2113.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2031.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw [18:02:50] wikikube-worker2022.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2052.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097. https://wikitech.wikimedia.org/wiki/PyBal [18:02:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2006.codfw.wmnet, parse2010.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube [18:02:50] 036.codfw.wmnet, parse2009.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2020.codfw.wmnet, w [18:02:51] worker2027.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikiku https://wikitech.wikimedia.org/wiki/PyBal [18:02:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:33] o_O [18:03:37] that's a new one [18:03:38] !incidents [18:03:38] 5361 (UNACKED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [18:03:38] 5360 (RESOLVED) Host db2190 (paged) - PING - Packet loss = 100% [18:03:56] FIRING: [8x] ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:02] um ... that might be the result of the recent migration of video encoding for commons [18:04:19] !ack 5361 [18:04:20] 5361 (ACKED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [18:04:30] looking [18:04:41] running out of replicas [18:05:11] gonna do a dirty bump just to get this cleared [18:05:28] thanks, hnowlan! [18:05:53] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:06:18] Looks like https://wikitech.wikimedia.org/wiki/Shellbox#Shellboxes doesn't have shellbox-video listed - intentional or needs updating? [18:07:03] needs updating, it's a relatively new one [18:07:09] (03PS3) 10Anzx: tcywikisource: fix typo of author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085454 (https://phabricator.wikimedia.org/T378555) [18:07:36] FIRING: [8x] ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:50] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:37] (03PS1) 10Ssingh: geo-maps: switch CN to to eqsin (from ulsfo) [dns] - 10https://gerrit.wikimedia.org/r/1085456 (https://phabricator.wikimedia.org/T378744) [18:08:43] need to bump limits before I can bump pods [18:10:02] (03PS1) 10Hnowlan: Bump shellbox-video limits, pods, reduce webvideotranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085457 [18:10:27] swfrench-wmf: could you give that ^ a look please if you have a sec? [18:10:48] webvideotranscodeprioritized is quieter so we don't need to limit concurrency there yet [18:11:35] (03CR) 10Scott French: [C:03+1] Bump shellbox-video limits, pods, reduce webvideotranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085457 (owner: 10Hnowlan) [18:11:50] jouncebot: nowandnext [18:11:50] For the next 1 hour(s) and 48 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1800) [18:11:51] In 1 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T2000) [18:11:53] hnowlan: looks good! [18:11:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T376905)', diff saved to https://phabricator.wikimedia.org/P70778 and previous config saved to /var/cache/conftool/dbconfig/20241031-181158-ladsgroup.json [18:12:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:12:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [18:12:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T376905)', diff saved to https://phabricator.wikimedia.org/P70779 and previous config saved to /var/cache/conftool/dbconfig/20241031-181225-ladsgroup.json [18:14:12] (03CR) 10Hnowlan: [C:03+2] Bump shellbox-video limits, pods, reduce webvideotranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085457 (owner: 10Hnowlan) [18:16:39] (03CR) 10Herron: [V:03+1 C:03+2] rsyslog::receiver: add hostname and fqdn to certificate names [puppet] - 10https://gerrit.wikimedia.org/r/1085450 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [18:17:40] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:41] (03Merged) 10jenkins-bot: Bump shellbox-video limits, pods, reduce webvideotranscode concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085457 (owner: 10Hnowlan) [18:18:16] hnowlan: ok to run the train? [18:18:32] dduvall: could you wait 5m? [18:18:39] sure no problem [18:18:42] thanks! [18:20:26] (03CR) 10BBlack: [C:03+1] "LGTM - https://puppet-compiler.wmflabs.org/output/1085446/4441/" [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [18:20:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1063906880 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:21:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T376905)', diff saved to https://phabricator.wikimedia.org/P70780 and previous config saved to /var/cache/conftool/dbconfig/20241031-182101-ladsgroup.json [18:21:17] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [18:22:11] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [18:22:36] FIRING: [7x] ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:37] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:22:47] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [18:23:00] 07sre-alert-triage, 06SRE Observability, 13Patch-For-Review: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#10282278 (10herron) 05Open→03Resolved a:03herron ` Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024... [18:23:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:23:08] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance centrallog2002:6514) - https://phabricator.wikimedia.org/T377703#10282282 (10herron) 05Open→03Resolved a:03herron (Fixed in T377703) [18:23:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:23:41] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [18:23:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:24:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085459 [18:24:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085459 (owner: 10TrainBranchBot) [18:24:27] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:24:59] (03CR) 10BCornwall: [C:03+1] geo-maps: switch CN to to eqsin (from ulsfo) [dns] - 10https://gerrit.wikimedia.org/r/1085456 (https://phabricator.wikimedia.org/T378744) (owner: 10Ssingh) [18:26:06] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:26:58] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:27:18] (03PS3) 10Ssingh: P:cache::varnish::frontend: add check for minimum frontend cache size [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) [18:28:15] (03PS2) 10Ottomata: Add airflow connection conf for datahub [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) [18:28:48] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4442/console" [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:32:11] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10282304 (10wiki_willy) Met with the Supermicro team today, who believes the RAID kit should be approved either today or tomorr... [18:33:10] (03CR) 10Ssingh: "Do not merge before week of Nov 4." [dns] - 10https://gerrit.wikimedia.org/r/1085456 (https://phabricator.wikimedia.org/T378744) (owner: 10Ssingh) [18:33:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:26] dduvall: sorry, this is more complicated than we thought [18:33:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:05] hnowlan: need a hand? [18:34:21] cdanis: wouldn't hurt :) I'll redirect to -sre [18:34:57] (03CR) 10Ladsgroup: [C:03+1] Drop support for s11 MariaDB section [puppet] - 10https://gerrit.wikimedia.org/r/1083586 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [18:36:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P70781 and previous config saved to /var/cache/conftool/dbconfig/20241031-183608-ladsgroup.json [18:47:11] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:47:20] (03PS1) 10Hnowlan: TimedMediaHandler: revert commonswiki changes due to capacity issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085464 [18:48:38] (03CR) 10Scott French: [C:03+1] TimedMediaHandler: revert commonswiki changes due to capacity issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085464 (owner: 10Hnowlan) [18:48:41] (03CR) 10CDanis: [C:03+1] TimedMediaHandler: revert commonswiki changes due to capacity issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085464 (owner: 10Hnowlan) [18:49:03] (03CR) 10Ladsgroup: [C:04-1] "I suggest waiting for Manuel to come back before merging this. I know it's handy but at the same time, it increases attack vector (think o" [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb) [18:51:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P70782 and previous config saved to /var/cache/conftool/dbconfig/20241031-185115-ladsgroup.json [18:53:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085459 (owner: 10TrainBranchBot) [18:55:04] jouncebot: now [18:55:04] For the next 1 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T1800) [18:55:19] swfrench-wmf: dduvall has been holding [18:55:29] dduvall: we're going to backport a revert of an earlier change, at which point we'll get out of your way [18:55:41] sounds good [18:57:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085464 (owner: 10Hnowlan) [18:57:53] (03Merged) 10jenkins-bot: TimedMediaHandler: revert commonswiki changes due to capacity issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085464 (owner: 10Hnowlan) [18:57:58] (03PS1) 10Fabfur: hiera: fix haproxykafka workers number [puppet] - 10https://gerrit.wikimedia.org/r/1085465 (https://phabricator.wikimedia.org/T377614) [18:58:27] (03CR) 10Brouberol: "Looks good! I'll let you update the airflow-analytics-test config block" [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:58:39] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1085464|TimedMediaHandler: revert commonswiki changes due to capacity issues]] [18:59:47] FIRING: HelmReleaseBadStatus: Helm release shellbox-video/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=shellbox-video - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:00:59] (03CR) 10Ssingh: [C:03+2] P:cache::varnish::frontend: add check for minimum frontend cache size [puppet] - 10https://gerrit.wikimedia.org/r/1085446 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:01:07] !log swfrench@deploy2002 swfrench, hnowlan: Backport for [[gerrit:1085464|TimedMediaHandler: revert commonswiki changes due to capacity issues]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:01:36] !log swfrench@deploy2002 swfrench, hnowlan: Continuing with sync [19:02:39] (03CR) 10Ottomata: [V:03+1] Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [19:06:17] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085464|TimedMediaHandler: revert commonswiki changes due to capacity issues]] (duration: 07m 38s) [19:06:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T376905)', diff saved to https://phabricator.wikimedia.org/P70783 and previous config saved to /var/cache/conftool/dbconfig/20241031-190622-ladsgroup.json [19:06:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:06:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [19:06:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T376905)', diff saved to https://phabricator.wikimedia.org/P70784 and previous config saved to /var/cache/conftool/dbconfig/20241031-190648-ladsgroup.json [19:06:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:12] dduvall: we should be out of your way now [19:07:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:26] swfrench-wmf: excellent. thank you [19:07:48] (03PS1) 10Aude: Helm chart for the chart-renderer service - WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [19:08:09] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085468 (https://phabricator.wikimedia.org/T375660) [19:08:10] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085468 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [19:08:33] (03CR) 10CI reject: [V:04-1] Helm chart for the chart-renderer service - WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [19:08:52] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085468 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [19:10:38] 06SRE, 06Traffic, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10282438 (10ssingh) a:03CDobbins [19:10:57] here comes the halloween choo choo https://finalfantasy.fandom.com/wiki/Phantom_Train_(Final_Fantasy_VI_boss) [19:12:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:15:45] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.1 refs T375660 [19:15:59] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [19:16:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T376905)', diff saved to https://phabricator.wikimedia.org/P70785 and previous config saved to /var/cache/conftool/dbconfig/20241031-191626-ladsgroup.json [19:29:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10282487 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm I powered off and drained the flea power. updated the bios and idrac firmware. It's currently up. Please let us know if it g... [19:31:30] (03PS1) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [19:31:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P70786 and previous config saved to /var/cache/conftool/dbconfig/20241031-193133-ladsgroup.json [19:32:06] (03CR) 10CI reject: [V:04-1] [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [19:34:37] (03CR) 10BryanDavis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [19:46:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P70787 and previous config saved to /var/cache/conftool/dbconfig/20241031-194640-ladsgroup.json [19:58:59] !log dancy@deploy2002 Installing scap version "4.119.2" for 210 hosts [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241031T2000). [20:00:05] JSherman and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:48] I'm upgrading scap at the moment. It should be done in a few minutes. [20:00:57] dancy: ack [20:01:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T376905)', diff saved to https://phabricator.wikimedia.org/P70788 and previous config saved to /var/cache/conftool/dbconfig/20241031-200148-ladsgroup.json [20:01:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [20:02:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [20:02:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T376905)', diff saved to https://phabricator.wikimedia.org/P70789 and previous config saved to /var/cache/conftool/dbconfig/20241031-200214-ladsgroup.json [20:03:52] !log dancy@deploy2002 Installation of scap version "4.119.2" completed for 210 hosts [20:04:28] JSherman: Mind if I deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1085454 first? [20:04:42] dancy: go for it [20:06:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085454 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [20:07:08] anzx: Checking to see if you're around [20:07:30] (03Merged) 10jenkins-bot: tcywikisource: fix typo of author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085454 (https://phabricator.wikimedia.org/T378555) (owner: 10Anzx) [20:07:48] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1085454|tcywikisource: fix typo of author namespace (T378555)]] [20:08:06] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [20:10:04] !log dancy@deploy2002 dancy, anzx: Backport for [[gerrit:1085454|tcywikisource: fix typo of author namespace (T378555)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:37] dancy: are you volunteering to do the backport after the scap upgrade? [20:10:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T376905)', diff saved to https://phabricator.wikimedia.org/P70790 and previous config saved to /var/cache/conftool/dbconfig/20241031-201042-ladsgroup.json [20:10:53] !log dancy@deploy2002 dancy, anzx: Continuing with sync [20:11:36] Actually I need to hand it back to you after this mediawiki-config is done. I just wanted to test to make sure everything was still good. [20:11:39] (which it is) [20:12:49] I can self service my backport if that is useful [20:13:01] Looks like JsnSherman's changes would be good to deploy in a single run. It'll probably result in a slow deployment due to l10n rebuild. [20:15:34] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085454|tcywikisource: fix typo of author namespace (T378555)]] (duration: 07m 46s) [20:15:50] T378555: Add logos, namespaces, SITENAME and timezone for Tulu Wikisource - https://phabricator.wikimedia.org/T378555 [20:16:05] JSherman: All yours [20:16:15] dancy: thanks! [20:16:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084889 (https://phabricator.wikimedia.org/T370795) (owner: 10Jsn.sherman) [20:16:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084891 (https://phabricator.wikimedia.org/T372476) (owner: 10Jsn.sherman) [20:19:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10282672 (10Ladsgroup) Thank you! [20:22:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [20:23:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [20:24:47] RESOLVED: HelmReleaseBadStatus: Helm release shellbox-video/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=shellbox-video - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:25:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [20:25:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P70791 and previous config saved to /var/cache/conftool/dbconfig/20241031-202549-ladsgroup.json [20:25:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [20:26:41] (03Merged) 10jenkins-bot: Translations for configuration for same-user-same-page reverts in Automoderator [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084889 (https://phabricator.wikimedia.org/T370795) (owner: 10Jsn.sherman) [20:27:57] (03Merged) 10jenkins-bot: Add follow-up message [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084891 (https://phabricator.wikimedia.org/T372476) (owner: 10Jsn.sherman) [20:28:16] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1084889|Translations for configuration for same-user-same-page reverts in Automoderator (T370795)]], [[gerrit:1084891|Add follow-up message (T372476)]] [20:28:35] T370795: Implement a limit and configuration for same-user-same-page reverts in Automoderator - https://phabricator.wikimedia.org/T370795 [20:28:36] T372476: Don't send entire new talk page message when user has been reverted multiple times by Automoderator in the same month - https://phabricator.wikimedia.org/T372476 [20:40:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P70792 and previous config saved to /var/cache/conftool/dbconfig/20241031-204057-ladsgroup.json [20:46:10] !log jsn@deploy2002 jsn: Backport for [[gerrit:1084889|Translations for configuration for same-user-same-page reverts in Automoderator (T370795)]], [[gerrit:1084891|Add follow-up message (T372476)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:14] !log jsn@deploy2002 jsn: Continuing with sync [20:46:16] T370795: Implement a limit and configuration for same-user-same-page reverts in Automoderator - https://phabricator.wikimedia.org/T370795 [20:46:16] T372476: Don't send entire new talk page message when user has been reverted multiple times by Automoderator in the same month - https://phabricator.wikimedia.org/T372476 [20:47:16] just noting that I checked for the presence of the new messages in the patches on the k8s debug endpoint while waiting on the remaining test hosts to finish up [20:53:56] FIRING: [3x] ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:27] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084889|Translations for configuration for same-user-same-page reverts in Automoderator (T370795)]], [[gerrit:1084891|Add follow-up message (T372476)]] (duration: 27m 10s) [20:55:33] T370795: Implement a limit and configuration for same-user-same-page reverts in Automoderator - https://phabricator.wikimedia.org/T370795 [20:55:33] T372476: Don't send entire new talk page message when user has been reverted multiple times by Automoderator in the same month - https://phabricator.wikimedia.org/T372476 [20:56:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T376905)', diff saved to https://phabricator.wikimedia.org/P70793 and previous config saved to /var/cache/conftool/dbconfig/20241031-205604-ladsgroup.json [20:56:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:56:11] okay, verified that the messages are now available on production endpoints. I'm done! [20:56:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:56:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T376905)', diff saved to https://phabricator.wikimedia.org/P70794 and previous config saved to /var/cache/conftool/dbconfig/20241031-205631-ladsgroup.json [21:05:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T376905)', diff saved to https://phabricator.wikimedia.org/P70795 and previous config saved to /var/cache/conftool/dbconfig/20241031-210504-ladsgroup.json [21:06:43] (03CR) 10Ebernhardson: [C:03+2] cirrus: update link to prom based elasticsearch-percentiles dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1083275 (https://phabricator.wikimedia.org/T371061) (owner: 10DCausse) [21:08:16] (03Merged) 10jenkins-bot: cirrus: update link to prom based elasticsearch-percentiles dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1083275 (https://phabricator.wikimedia.org/T371061) (owner: 10DCausse) [21:18:49] !log dancy@deploy2002 Installing scap version "4.119.3" for 210 hosts [21:18:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bullseye [21:19:44] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1019.eqiad.wmnet with OS bullseye [21:19:49] (03PS2) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [21:20:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P70796 and previous config saved to /var/cache/conftool/dbconfig/20241031-212011-ladsgroup.json [21:21:47] (03CR) 10CI reject: [V:04-1] [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [21:22:18] !log Bootstrapping Cassandra/aqs1022-b — T378725 [21:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:24] T378725: Refresh aqs1013 w/ aqs1022 - https://phabricator.wikimedia.org/T378725 [21:22:52] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:22:58] (03CR) 10Cwhite: [C:03+2] "PCC NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1085397 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [21:23:56] FIRING: [2x] ProbeDown: Service aqs1022-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:41] (03CR) 10Cwhite: [C:03+2] beta-logs: set phatality version to 2.7.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085409 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [21:35:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:35:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P70797 and previous config saved to /var/cache/conftool/dbconfig/20241031-213518-ladsgroup.json [21:35:57] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:37:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:37:16] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:40:23] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [21:40:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bullseye [21:49:32] (03PS3) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [21:50:05] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host an-presto1019.eqiad.wmnet with OS bullseye [21:50:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T376905)', diff saved to https://phabricator.wikimedia.org/P70798 and previous config saved to /var/cache/conftool/dbconfig/20241031-215025-ladsgroup.json [21:50:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:50:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [21:50:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:50:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:50:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T376905)', diff saved to https://phabricator.wikimedia.org/P70799 and previous config saved to /var/cache/conftool/dbconfig/20241031-215056-ladsgroup.json [21:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:59:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T376905)', diff saved to https://phabricator.wikimedia.org/P70800 and previous config saved to /var/cache/conftool/dbconfig/20241031-215925-ladsgroup.json [22:14:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P70801 and previous config saved to /var/cache/conftool/dbconfig/20241031-221432-ladsgroup.json [22:15:06] (03PS4) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [22:21:17] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bullseye [22:28:13] (03PS1) 10Cwhite: opensearch_dashboards: package provider must remove before install [puppet] - 10https://gerrit.wikimedia.org/r/1085486 (https://phabricator.wikimedia.org/T342476) [22:29:00] (03CR) 10CI reject: [V:04-1] opensearch_dashboards: package provider must remove before install [puppet] - 10https://gerrit.wikimedia.org/r/1085486 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [22:29:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P70802 and previous config saved to /var/cache/conftool/dbconfig/20241031-222939-ladsgroup.json [22:29:53] !log dancy@deploy2002 Installing scap version "4.119.4" for 1 hosts [22:30:45] !log dancy@deploy2002 Installation of scap version "4.119.4" completed for 1 hosts [22:37:18] (03PS1) 10Ahmon Dancy: Dummy commit for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085487 [22:39:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085487 (owner: 10Ahmon Dancy) [22:40:18] (03Merged) 10jenkins-bot: Dummy commit for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085487 (owner: 10Ahmon Dancy) [22:40:37] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1085487|Dummy commit for testing]] [22:43:05] !log dancy@deploy2002 dancy: Backport for [[gerrit:1085487|Dummy commit for testing]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:43:22] !log dancy@deploy2002 dancy: Continuing with sync [22:43:53] (03PS5) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [22:44:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T376905)', diff saved to https://phabricator.wikimedia.org/P70803 and previous config saved to /var/cache/conftool/dbconfig/20241031-224446-ladsgroup.json [22:44:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:45:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:45:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T376905)', diff saved to https://phabricator.wikimedia.org/P70804 and previous config saved to /var/cache/conftool/dbconfig/20241031-224513-ladsgroup.json [22:45:59] (03PS6) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [22:46:00] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1019.eqiad.wmnet with OS bullseye [22:48:06] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085487|Dummy commit for testing]] (duration: 07m 28s) [22:54:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T376905)', diff saved to https://phabricator.wikimedia.org/P70805 and previous config saved to /var/cache/conftool/dbconfig/20241031-225442-ladsgroup.json [23:04:52] (03PS2) 10Aude: Helm chart for the chart-renderer service - WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [23:05:37] (03PS3) 10Aude: Helm chart for the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [23:07:46] (03PS4) 10Aude: Helm chart for the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [23:09:12] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:09:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:09:24] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:09:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P70806 and previous config saved to /var/cache/conftool/dbconfig/20241031-230949-ladsgroup.json [23:10:47] (03PS5) 10Aude: Helm chart for the chart-renderer service - WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [23:12:59] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [23:13:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [23:15:14] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:15:24] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [23:15:24] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:24:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P70807 and previous config saved to /var/cache/conftool/dbconfig/20241031-232456-ladsgroup.json [23:25:49] (03PS2) 10Scott French: mediawiki: ensure default php.version is a string [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085491 (https://phabricator.wikimedia.org/T372604) [23:26:39] (03CR) 10RLazarus: [C:03+1] mediawiki: ensure default php.version is a string [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085491 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [23:27:37] jouncebot: nowandnext [23:27:37] No deployments scheduled for the next 6 hour(s) and 32 minute(s) [23:27:37] In 6 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241101T0600) [23:29:20] (03CR) 10Scott French: [C:03+2] mediawiki: ensure default php.version is a string [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085491 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [23:30:09] FYI, I'll be deploying a helm-only change shortly to clear a noop chart diff [23:31:24] (03PS1) 10Jdlrobson: Enable Chart progressive enhancement on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085492 (https://phabricator.wikimedia.org/T378206) [23:31:44] (03Merged) 10jenkins-bot: mediawiki: ensure default php.version is a string [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085491 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [23:35:38] !log swfrench@deploy2002 Started scap sync-world: Deployment to clear noop chart diff from 1085491 - T372604 T377040 [23:35:44] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [23:35:45] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [23:36:20] (03PS7) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [23:37:28] !log swfrench@deploy2002 Finished scap sync-world: Deployment to clear noop chart diff from 1085491 - T372604 T377040 (duration: 01m 49s) [23:38:42] all done [23:40:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T376905)', diff saved to https://phabricator.wikimedia.org/P70808 and previous config saved to /var/cache/conftool/dbconfig/20241031-234003-ladsgroup.json [23:40:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:40:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:40:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70809 and previous config saved to /var/cache/conftool/dbconfig/20241031-234030-ladsgroup.json [23:41:15] !log Run extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php for several wikis (T376749; wiki list is on task) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:19] T376749: Run Flow migration script at Phase 0 wikis - https://phabricator.wikimedia.org/T376749 [23:50:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70827 and previous config saved to /var/cache/conftool/dbconfig/20241031-234959-ladsgroup.json [23:58:39] (03CR) 10BryanDavis: "Cherry-picked to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud for work-in-progress testing." [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)