[00:00:49] (03CR) 10Dzahn: [C: 03+2] "parameter 'ip_families' index 0 expects a match for Enum['ip4', 'ip6'], got 'ipv4'" [puppet] - 10https://gerrit.wikimedia.org/r/884396 (https://phabricator.wikimedia.org/T327974) (owner: 10Dzahn) [00:02:06] (03PS1) 10Dzahn: etherpad: fix ip_family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/885057 (https://phabricator.wikimedia.org/T327974) [00:02:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:02:27] (03CR) 10Dzahn: [C: 03+2] etherpad: fix ip_family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/885057 (https://phabricator.wikimedia.org/T327974) (owner: 10Dzahn) [00:04:30] (03PS1) 10Zabe: Set 'groupLoadsBySection' for s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885058 (https://phabricator.wikimedia.org/T326980) [00:06:10] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [00:06:53] (03PS2) 10Zabe: Set 'groupLoadsBySection' for s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885058 (https://phabricator.wikimedia.org/T326980) [00:09:15] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [00:11:22] PROBLEM - MariaDB Replica Lag: s3 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:27] !log etherpad - maintenance downtime for about 5 minutes to test monitoring [00:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:18] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1003 is CRITICAL: connect to address 10.64.32.181 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [00:18:06] well, yea, that's the icinga alert [00:18:14] but I want to know if the alertmanager alert works [00:18:39] and I dont see any of that [00:19:05] and unlike icinga you cant actually see alerts that are not alerting.. so ...dont know how to confirm it works [00:19:13] or doesnt work [00:19:48] I do see an etherpad alert at the top of https://alerts.wikimedia.org/ [00:20:15] yea, but that's the wrong team [00:20:21] not the one I added ..hmm [00:20:35] and where did that actually alert [00:20:56] wait.. maybe it is that one , heh [00:20:57] oh, there's a ProbeDown alert further down the page with team: serviceops-collab [00:21:17] I had to search by "etherpad" to narrow the field enough to see it, but there it is [00:21:42] rzl: yea, that's the one. and I did get email now. thanks [00:21:53] * mutante turns Etherpad back on :) [00:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:22:34] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1003 is OK: HTTP OK: HTTP/1.1 200 OK - 6448 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [00:23:33] rzl: now I just need to adjust my "team" actions to make it create tickets :) cool [00:23:47] nice [00:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:25:59] you can also easily link to just "all alerts for team X" [00:30:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:09] (03PS7) 10Urbanecm: Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15) [00:38:38] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15) [00:40:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:42:37] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5027.eqsin.wmnet with OS bullseye [00:42:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5027.eqsin.wmnet with OS bullseye completed: - cp5027 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [00:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:50:04] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet [00:50:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [00:53:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) For posterity, the versions of the iDRAC and the NIC firmware that we are looking for for the cp hosts bullseye upgrade and that we pass to the firmware cookbook/upload on the HTTP management interface:... [01:18:50] PROBLEM - MariaDB Replica Lag: m1 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 872.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:22:24] RECOVERY - MariaDB Replica Lag: m1 on db2160 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:31:57] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3053.esams.wmnet'] [01:32:09] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp3053.esams.wmnet'] [01:35:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS bullseye [01:38:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3053.esams.wmnet with OS bullseye [01:40:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:59:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3053.esams.wmnet with reason: host reimage [02:02:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3053.esams.wmnet with reason: host reimage [02:04:01] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) Our current Mailman deployment is a bunch of backported and forked debs with random patches thrown on top based on what we managed to fix upstream.... [02:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:33] RECOVERY - MariaDB Replica Lag: s3 on db1102 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:20:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3053.esams.wmnet with OS bullseye [02:29:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3053.esams.wmnet with OS bullseye completed: - cp3053 (**WARN**) - Removed from Puppet and PuppetDB if present -... [02:35:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [02:43:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3053.esams.wmnet,service=cdn [02:43:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3053.esams.wmnet,service=ats-be [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0300) [03:06:15] (03PS1) 10Krinkle: multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) [03:06:16] (03PS1) 10Krinkle: logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 [03:06:19] (03PS1) 10Krinkle: multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) [03:06:28] (03CR) 10CI reject: [V: 04-1] multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [03:07:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.21 [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885010 (https://phabricator.wikimedia.org/T325584) [03:07:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.21 [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885010 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [03:08:16] (03PS2) 10Krinkle: multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) [03:08:18] (03PS2) 10Krinkle: logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 [03:08:20] (03PS2) 10Krinkle: multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) [03:24:27] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.21 [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885010 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [03:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:29] RECOVERY - dump of matomo in eqiad on backupmon1001 is OK: Last dump for matomo at eqiad (db1108) taken on 2023-01-31 03:47:03 (281 MiB, +4.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0400) [04:01:23] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885068 (https://phabricator.wikimedia.org/T325584) [04:01:29] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885068 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [04:02:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885068 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [04:02:30] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.21 refs T325584 [04:02:56] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [04:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:30:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:35:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:55:26] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.21 refs T325584 (duration: 52m 56s) [04:56:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [04:56:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [04:57:43] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.19 (duration: 02m 15s) [04:58:26] 10SRE-OnFire, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10andrea.denisse) [05:01:35] (03CR) 10Ladsgroup: [C: 03+1] "I will deploy it a bit later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885058 (https://phabricator.wikimedia.org/T326980) (owner: 10Zabe) [05:05:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:11] (03CR) 10Ladsgroup: [C: 03+1] "LGTM https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1466/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [06:15:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:00] (03PS1) 10Muehlenhoff: Apply role::installserver to install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885246 (https://phabricator.wikimedia.org/T327867) [06:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:52:47] !log dbmaint Schema change on s8 eqiad T328373 [06:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:00] !log dbmaint Schema change on s4 eqiad T328373 [06:53:56] !log dbmaint Schema change on s6 eqiad T328373 [06:54:53] !log dbmaint Schema change on s2 eqiad T328373 [06:59:24] !log dbmaint Schema change on s7 eqiad T328373 [06:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0700) [07:00:04] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0700). [07:00:05] (03PS1) 10Muehlenhoff: New stub keytabs for the new install servers [labs/private] - 10https://gerrit.wikimedia.org/r/885263 (https://phabricator.wikimedia.org/T327867) [07:00:14] nothing for today [07:02:45] (03PS1) 10Marostegui: db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885264 (https://phabricator.wikimedia.org/T328253) [07:03:05] !log dbmaint Schema change on s5 eqiad T328373 [07:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:11] (03CR) 10Marostegui: [C: 03+2] db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885264 (https://phabricator.wikimedia.org/T328253) (owner: 10Marostegui) [07:04:13] T328373: Drop default value from cul_actor on wmf wikis - https://phabricator.wikimedia.org/T328373 [07:06:21] (03PS1) 10Marostegui: mariadb: Promote db1195 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/885265 (https://phabricator.wikimedia.org/T328253) [07:06:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2133,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Primary switchover m2 T328253 [07:06:42] T328253: Switchover m2 master db1164 -> db1195 - https://phabricator.wikimedia.org/T328253 [07:06:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2133,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Primary switchover m2 T328253 [07:07:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1195 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/885265 (https://phabricator.wikimedia.org/T328253) (owner: 10Marostegui) [07:10:23] !log Failover m2 from db1164 to db1195 - T328253 [07:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:16:28] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885268 (https://phabricator.wikimedia.org/T328402) [07:16:52] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885268 (https://phabricator.wikimedia.org/T328402) (owner: 10Marostegui) [07:22:39] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] New stub keytabs for the new install servers [labs/private] - 10https://gerrit.wikimedia.org/r/885263 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [07:22:42] !log dbmaint Schema change on s1 eqiad T328373 [07:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:46] T328373: Drop default value from cul_actor on wmf wikis - https://phabricator.wikimedia.org/T328373 [07:22:49] !log dbmaint Schema change on s3 eqiad T328373 [07:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:47] (03PS1) 10Marostegui: mariadb: Move db1164 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/885269 (https://phabricator.wikimedia.org/T328402) [07:31:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/885269 (https://phabricator.wikimedia.org/T328402) (owner: 10Marostegui) [07:32:04] moritzm: there are pending puppet changes from you [07:35:10] ah, right. forgot about merging the labs-private ones, fixing that now [07:35:17] done [07:37:11] thanks! [07:40:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:50:10] (03CR) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [07:50:29] (03PS4) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) [07:50:31] (03PS3) 10Giuseppe Lavagetto: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 [07:50:33] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10MoritzMuehlenhoff) I've synched up with ITS, they will shut down the ldap1.corp.wikimedia.org server that we synched against next calendar year. [07:55:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [07:56:19] (03Merged) 10jenkins-bot: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [07:56:38] !log installing bash bugfix updates from Bullseye point release [07:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:44] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [08:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:16] awesome [08:04:47] (03PS11) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [08:05:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto) [08:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:06:08] (03Merged) 10jenkins-bot: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto) [08:06:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39331/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [08:10:06] (03PS4) 10Phedenskog: Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) [08:10:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:11:19] (03PS12) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [08:13:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39332/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [08:15:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:22:17] 10SRE, 10Infrastructure Security, 10observability: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10MoritzMuehlenhoff) [08:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:36:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:39:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:41:35] (03CR) 10Muehlenhoff: [C: 03+2] exim: Remove leftovers of ldap-corp setup [puppet] - 10https://gerrit.wikimedia.org/r/884282 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [08:42:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:43:39] (03PS1) 10Ilias Sarantopoulos: ci: add pre-commit hooks [software/httpbb] - 10https://gerrit.wikimedia.org/r/885273 [08:43:52] 10SRE, 10Infrastructure Security, 10observability: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10fgiunchedi) Upgrading SGTM, I don't see 8.5.16 on apt.grafana.org yet though: https://apt.grafana.com/dists/stable/main/binary-amd64/Packages.gz [08:45:29] !log restore previously removed password for keystore to kafka-logging clusters [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:48:49] (03PS1) 10Zabe: Stop writing to cuc_user and cuc_user_text in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885274 (https://phabricator.wikimedia.org/T233004) [08:49:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:50:13] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_user and cuc_user_text in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885274 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [08:50:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885274 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [08:51:06] (03Merged) 10jenkins-bot: Stop writing to cuc_user and cuc_user_text in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885274 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [08:51:52] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885274|Stop writing to cuc_user and cuc_user_text in testwiki (T233004)]] [08:51:57] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:53:55] !log zabe@deploy1002 zabe: Backport for [[gerrit:885274|Stop writing to cuc_user and cuc_user_text in testwiki (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:54:19] !log roll restart kafka on kafka-logging1001 to pick up new pki certs [08:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:00:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:00:03] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885274|Stop writing to cuc_user and cuc_user_text in testwiki (T233004)]] (duration: 08m 11s) [09:00:08] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [09:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:01:56] 10SRE, 10Infrastructure Security, 10observability: Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10fgiunchedi) Opened an issue with upstream re: apt repo update https://github.com/grafana/grafana/issues/62544 [09:05:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:07:19] (03PS2) 10Muehlenhoff: Apply role::installserver to install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885246 (https://phabricator.wikimedia.org/T327867) [09:09:46] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) >>! In T315676#8572237, @Jcross wrote: > Hi @BBlack and @Vgutierrez - could you please provide an update or some guidance around your expected timeline for this? Please let us... [09:10:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:47] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) Good to know, thanks! [09:11:47] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [09:11:54] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [09:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:10] (03PS1) 10Marostegui: db2093: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/885278 (https://phabricator.wikimedia.org/T328408) [09:20:43] !log dbmaint Install MariaDB 10.6 on db2093 (db_inventory) T328408 [09:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:48] T328408: Migrate db_inventory section to MariaDB 10.6 - https://phabricator.wikimedia.org/T328408 [09:20:50] (03CR) 10Marostegui: [C: 03+2] db2093: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/885278 (https://phabricator.wikimedia.org/T328408) (owner: 10Marostegui) [09:25:12] (03PS14) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [09:28:56] (03CR) 10Muehlenhoff: [C: 03+2] Apply role::installserver to install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885246 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:42] (03PS2) 10JMeybohm: Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/884900 (https://phabricator.wikimedia.org/T327664) [10:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:23] (03PS1) 10Muehlenhoff: Move webproxy.codfw.wmnet to install2004 [dns] - 10https://gerrit.wikimedia.org/r/885285 (https://phabricator.wikimedia.org/T327867) [10:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "we could have the same in jobs-api" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868791 (owner: 10Majavah) [10:13:45] (03Merged) 10jenkins-bot: add unit tests for parse_quantity [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868791 (owner: 10Majavah) [10:18:14] !log switching active kubernetes staging cluster from eqiad to codfw - T327664 [10:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:19] T327664: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 [10:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:01] (03CR) 10JMeybohm: [C: 03+2] Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/884900 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [10:21:59] (03CR) 10EoghanGaffney: [C: 03+2] Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [10:23:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [10:23:28] (03CR) 10JMeybohm: [C: 03+2] Drop profile::ci::kubernetes_config [puppet] - 10https://gerrit.wikimedia.org/r/884915 (owner: 10JMeybohm) [10:27:37] (03CR) 10Arturo Borrero Gonzalez: "Thanks. change LGTM. Minor stuff inline." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) (owner: 10Majavah) [10:37:48] (03PS1) 10Giuseppe Lavagetto: icinga: remove mediawiki alerts [puppet] - 10https://gerrit.wikimedia.org/r/885288 [10:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:40:29] (03PS4) 10Giuseppe Lavagetto: Fix PHP string interpolation [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [10:40:31] (03PS1) 10Giuseppe Lavagetto: nagios: remove obsolete command check_all_memcached.php [puppet] - 10https://gerrit.wikimedia.org/r/885289 [10:43:27] (03PS1) 10Jelto: sre.gitlab.upgrade: fix location of gitlab version-manifest.json [cookbooks] - 10https://gerrit.wikimedia.org/r/885291 (https://phabricator.wikimedia.org/T323569) [10:45:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [10:46:38] 10SRE-OnFire, 10Maps (Kartotherian), 10Sustainability (Incident Followup), 10Technical-Debt: Kartotherian configuration should be deployable to all production envs at once - https://phabricator.wikimedia.org/T328406 (10awight) [10:50:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/885288 (owner: 10Giuseppe Lavagetto) [10:50:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [10:52:08] (03PS1) 10Filippo Giunchedi: sre: cosmetic-only changes for mw alerts [alerts] - 10https://gerrit.wikimedia.org/r/885293 [10:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:56:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:57:45] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=k8s-ingress-staging [10:57:46] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=k8s-ingress-staging [10:57:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39334/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [10:59:28] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: cosmetic-only changes for mw alerts [alerts] - 10https://gerrit.wikimedia.org/r/885293 (owner: 10Filippo Giunchedi) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1100) [11:00:43] (03PS2) 10Muehlenhoff: Move webproxy.codfw.wmnet to install2004 [dns] - 10https://gerrit.wikimedia.org/r/885285 (https://phabricator.wikimedia.org/T327867) [11:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:01:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:06:22] (03CR) 10Muehlenhoff: [C: 03+2] Move webproxy.codfw.wmnet to install2004 [dns] - 10https://gerrit.wikimedia.org/r/885285 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [11:07:36] (03PS31) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:07:40] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10phaultfinder) [11:12:37] (03PS3) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [11:14:27] (03CR) 10Jelto: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/885291 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:21:27] !log installing bind9 security updates (client-side tools/libs only) [11:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:58] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/885291 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:25:28] (03PS4) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [11:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884998 (owner: 10Jbond) [11:30:05] (03PS1) 10EoghanGaffney: Add /var/log/mail.{log,info,err,warn} to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321760) [11:30:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:32:16] (03CR) 10Volans: "did just a super quick pass, I'll redo a full pass once CI passes too" [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [11:32:50] (03CR) 10Jbond: [C: 03+2] reposync: switch from copy_tree to copytree [software/spicerack] - 10https://gerrit.wikimedia.org/r/884998 (owner: 10Jbond) [11:35:15] (03PS5) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [11:36:19] (03PS1) 10JMeybohm: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885297 (https://phabricator.wikimedia.org/T327664) [11:36:21] (03PS6) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [11:36:36] (03Merged) 10jenkins-bot: reposync: switch from copy_tree to copytree [software/spicerack] - 10https://gerrit.wikimedia.org/r/884998 (owner: 10Jbond) [11:37:34] (03CR) 10Volans: "The change makes sense to me, but it would be nice to know that it makes sense also based on other redfish implementations." [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [11:38:53] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: get gitlab version from API [cookbooks] - 10https://gerrit.wikimedia.org/r/885291 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:39:02] (03PS5) 10Majavah: kubernetes: Apply resource changes on restart [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) [11:39:15] (03CR) 10Majavah: kubernetes: Apply resource changes on restart (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) (owner: 10Majavah) [11:39:18] (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (033 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [11:39:47] (03CR) 10CI reject: [V: 04-1] kubernetes: Apply resource changes on restart [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) (owner: 10Majavah) [11:40:31] (03CR) 10Volans: "LGTM, question and nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [11:40:41] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: get gitlab version from API [cookbooks] - 10https://gerrit.wikimedia.org/r/885291 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:40:43] (03PS15) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [11:41:13] (03PS6) 10Majavah: kubernetes: Apply resource changes on restart [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) [11:41:15] (03PS13) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [11:41:27] (03CR) 10Volans: redfish: Move dell specific functionality to dell class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [11:41:36] (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [11:42:37] (03PS32) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:42:54] (03CR) 10Volans: "LGTM, some lines are reported as untested" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [11:44:08] (03PS33) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:44:15] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:45:20] (03PS1) 10Jbond: add profile::idm::server::oidc_secret [labs/private] - 10https://gerrit.wikimedia.org/r/885300 [11:45:34] (03PS16) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [11:45:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] add profile::idm::server::oidc_secret [labs/private] - 10https://gerrit.wikimedia.org/r/885300 (owner: 10Jbond) [11:48:11] (03PS17) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [11:50:44] (03PS2) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 [11:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:50:55] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@42a07d3] (eqiad): Disable traffic mirroring from codfw to eqiad [11:51:31] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@42a07d3] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 00m 35s) [11:54:55] PROBLEM - kartotherian endpoints health on maps1010 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [11:54:55] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [11:54:55] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:55:07] PROBLEM - kartotherian endpoints health on maps1008 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [11:55:07] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [11:55:07] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:55:11] PROBLEM - kartotherian endpoints health on maps1006 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [11:55:11] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [11:55:11] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:55:39] PROBLEM - kartotherian endpoints health on maps1005 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [11:55:39] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [11:55:39] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:55:48] (03PS1) 10Slyngshede: Rename profile::idm::server::oidc_secret variable [labs/private] - 10https://gerrit.wikimedia.org/r/885301 [11:56:09] PROBLEM - kartotherian endpoints health on maps1007 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [11:56:09] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [11:56:09] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:56:09] PROBLEM - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 4 [11:56:10] cting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geol [11:56:10] eojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [11:57:57] kartotherian unhappy again? :/ [11:58:25] (03CR) 10Slyngshede: [V: 03+1] Rename profile::idm::server::oidc_secret variable [labs/private] - 10https://gerrit.wikimedia.org/r/885301 (owner: 10Slyngshede) [11:59:02] (03PS18) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [11:59:41] (03PS1) 10JMeybohm: k8s: Update staging-eqiad to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/885302 (https://phabricator.wikimedia.org/T327664) [12:00:32] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Rename profile::idm::server::oidc_secret variable [labs/private] - 10https://gerrit.wikimedia.org/r/885301 (owner: 10Slyngshede) [12:00:35] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Rename profile::idm::server::oidc_secret variable [labs/private] - 10https://gerrit.wikimedia.org/r/885301 (owner: 10Slyngshede) [12:01:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39335/console" [puppet] - 10https://gerrit.wikimedia.org/r/885302 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [12:02:15] (03PS2) 10JMeybohm: k8s: Update staging-eqiad to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/885302 (https://phabricator.wikimedia.org/T327664) [12:02:17] (03PS1) 10JMeybohm: install_server: Update kubestagetcd1* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/885303 (https://phabricator.wikimedia.org/T327664) [12:04:20] Lucas_WMDE: yeah we tried to deploy what we reverted yesterday but its only on eqiad so no production traffic is affected [12:04:40] still doesn't look happy [12:07:27] ok, I see [12:08:56] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/883911 (owner: 10L10n-bot) [12:11:45] (03PS1) 10Muehlenhoff: Update cloudbastion rules for install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885304 (https://phabricator.wikimedia.org/T327867) [12:13:46] (03PS14) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:15:26] (03PS1) 10Muehlenhoff: Stop DHCP on install2004 for now [puppet] - 10https://gerrit.wikimedia.org/r/885305 [12:15:29] (03PS1) 10Muehlenhoff: Point to install2004 for DHCP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/885326 (https://phabricator.wikimedia.org/T327867) [12:16:15] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Elitre) [12:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:23:04] ACKNOWLEDGEMENT - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected [12:23:04] 04 (expecting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200 [12:23:04] ine?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) Effie Mouzeli devs are working on it https://wi [12:23:04] ikimedia.org/wiki/Maps%23Kartotherian [12:25:44] (03PS2) 10Muehlenhoff: Stop DHCP on install2004 for now [puppet] - 10https://gerrit.wikimedia.org/r/885305 [12:25:51] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:28:17] (03CR) 10Muehlenhoff: [C: 03+2] Stop DHCP on install2004 for now [puppet] - 10https://gerrit.wikimedia.org/r/885305 (owner: 10Muehlenhoff) [12:34:16] (03PS15) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:36:04] (03PS1) 10Muehlenhoff: Fix name [puppet] - 10https://gerrit.wikimedia.org/r/885331 [12:36:10] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad [12:36:45] RECOVERY - kartotherian endpoints health on maps1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:36:46] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 00m 35s) [12:37:06] (03PS1) 10EoghanGaffney: Send exim mail.{log,info,warn,err} to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) [12:37:15] RECOVERY - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [12:37:15] RECOVERY - kartotherian endpoints health on maps1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:37:20] (03PS16) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:37:27] ^ FYI we reverted to previous healthy state since we figure out the problem (csp issues, x-amples on swagger not working as expected) cc effie [12:37:51] RECOVERY - kartotherian endpoints health on maps1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:38:01] RECOVERY - kartotherian endpoints health on maps1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:38:07] RECOVERY - kartotherian endpoints health on maps1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:40:00] (03PS17) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:40:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39340/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [12:43:58] (03CR) 10Muehlenhoff: [C: 03+2] Fix name [puppet] - 10https://gerrit.wikimedia.org/r/885331 (owner: 10Muehlenhoff) [12:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:46:34] (03CR) 10Jelto: "looks mostly good, one comment on commit message" [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321760) (owner: 10EoghanGaffney) [12:47:37] (03PS2) 10EoghanGaffney: Add /var/log/mail.{log,info,err,warn} to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321759) [12:47:58] (03PS18) 10Jaime Nuche: jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) [12:48:00] (03PS6) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) [12:48:03] (03PS4) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [12:48:05] (03PS1) 10Jaime Nuche: jenkins: remove redundant class parameter [puppet] - 10https://gerrit.wikimedia.org/r/885333 [12:49:24] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [12:50:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:51:57] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884934 (https://phabricator.wikimedia.org/T328357) (owner: 10Superpes15) [12:54:12] (03PS1) 10Muehlenhoff: Move next-server settings from install2003->2004 [puppet] - 10https://gerrit.wikimedia.org/r/885336 (https://phabricator.wikimedia.org/T327867) [12:54:39] (03PS5) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [12:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:57:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) install2004 has had the installserver role assigned and it's now acting the web proxy for codfw. The DHCP server is currrently stopped, tomorrow m... [12:59:03] (03CR) 10EoghanGaffney: Add /var/log/mail.{log,info,err,warn} to rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/885294 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [12:59:36] (03CR) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:00:17] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/884887/39341/" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:00:43] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/884891/39342/" [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:04:01] (03PS10) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [13:04:03] (03PS10) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [13:06:03] (03PS1) 10Daniel Kinzler: Bump parsoid parser cache writes to 25%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885337 (https://phabricator.wikimedia.org/T320534) [13:06:05] (03CR) 10Jbond: redfish: store all OOB info for later use (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [13:06:11] (03CR) 10Jbond: redfish: Move dell specific functionality to dell class (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [13:09:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubernetes: Apply resource changes on restart [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868790 (https://phabricator.wikimedia.org/T277495) (owner: 10Majavah) [13:10:44] (03CR) 10Arturo Borrero Gonzalez: "this needs manual rebase :-(" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [13:11:23] (03PS3) 10Majavah: kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) [13:11:37] (03CR) 10Majavah: kubernetes: Use the shared image-config configmap (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [13:15:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39343/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:23:01] (03CR) 10Jbond: redfish: add system_manager info (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [13:37:26] (03PS18) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:38:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39344/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:39:23] (03PS2) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [13:40:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:42:01] (03PS1) 10MSantos: mobileapps: bump to 2023-01-31-130212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885353 [13:42:40] (03CR) 10Volans: [C: 03+1] "I've mostly checked the varnish_dp_key_generator.py file and LGTM, I've left a minor nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [13:42:53] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by TheDJ us... [13:45:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:52] (03CR) 10Volans: "I've manually performed the steps at https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook to remove the ol" [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [13:47:26] (03PS2) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:48:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/885304 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:51:38] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2023-01-31-130212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885353 (owner: 10MSantos) [13:56:54] (03Merged) 10jenkins-bot: mobileapps: bump to 2023-01-31-130212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885353 (owner: 10MSantos) [13:58:42] (03CR) 10Slyngshede: [V: 03+1] "Allow IDM to authenticate with OIDC." [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:58:46] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10Patch-For-Review: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10isarantopoulos) a:03isarantopoulos [13:59:13] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10isarantopoulos) a:03isarantopoulos [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1400). nyaa~ [14:00:05] Dreamy_Jazz and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1400) [14:00:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:00:29] I have a meeting and probably can’t deploy, sorry [14:00:31] i can deploy today [14:00:38] Dreamy_Jazz: duesen: hi, around? [14:00:50] urbanecm: hi [14:00:51] \0 [14:00:57] urbanecm: that’s good, since one of the changes needs CU rights to verify too ^^ [14:01:03] :) [14:01:40] (03PS2) 10Urbanecm: Disable write old for CheckUserLog reason field for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885041 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:01:46] urbanecm: my config change is the same as last week. This time, we go from 10% to 25%. Nothing to test. [14:01:47] (03PS2) 10Urbanecm: Remove redundant definition of wgCheckUserEnableSpecialInvestigate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885051 (owner: 10Dreamy Jazz) [14:01:50] (03CR) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [14:01:53] duesen: ack [14:01:55] !log urbanecm@deploy1002 Backport cancelled. [14:02:07] (03PS2) 10Urbanecm: Bump parsoid parser cache writes to 25%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885337 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:02:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885041 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:02:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885051 (owner: 10Dreamy Jazz) [14:02:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885337 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:02:21] let's do them all in one go then [14:03:02] (03Merged) 10jenkins-bot: Disable write old for CheckUserLog reason field for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885041 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:03:06] (03Merged) 10jenkins-bot: Remove redundant definition of wgCheckUserEnableSpecialInvestigate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885051 (owner: 10Dreamy Jazz) [14:03:09] (03Merged) 10jenkins-bot: Bump parsoid parser cache writes to 25%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885337 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:03:27] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10Patch-For-Review: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10isarantopoulos) After discussing during the review with @RLazarus we went with the second approach. In the aforementioned patch the... [14:03:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] [14:03:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/885326 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [14:03:42] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:03:43] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:04:14] I will be able to test the investigate one, but for the other one of mine the test steps are: [14:04:14] * Make an entry into the CheckUserLog using any non-empty reason [14:04:15] * Inspect that row in the database to ensure cul_reason is the empty string [14:04:19] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-eqiad with k8s 1.23 [14:04:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-eqiad with k8s 1.23 [14:05:08] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:05:26] !log urbanecm@deploy1002 urbanecm and dreamyjazz and daniel: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwde [14:05:26] bug1002.eqiad.wmnet [14:05:32] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:05:36] Dreamy_Jazz: pulled to mwdebug for testing [14:05:44] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:06:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39345/console" [puppet] - 10https://gerrit.wikimedia.org/r/885333 (owner: 10Jaime Nuche) [14:06:18] (03CR) 10Jbond: [C: 03+2] "LGTM and will merge as its a noop" [puppet] - 10https://gerrit.wikimedia.org/r/885333 (owner: 10Jaime Nuche) [14:06:31] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:06:58] Special:Investigate shows and I can run a check using it on mwdebug1001 [14:07:04] So that change is good [14:07:51] (03CR) 10Jbond: "After merging i noticed that this will cause a change on the cloud instances" [puppet] - 10https://gerrit.wikimedia.org/r/885333 (owner: 10Jaime Nuche) [14:07:56] (03CR) 10JMeybohm: [C: 03+2] install_server: Update kubestagetcd1* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/885303 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [14:07:57] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:08:44] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:09:03] The other is on testwiki so will need someone else than me to test [14:09:12] jbond: feel free to merge my change [14:09:35] Dreamy_Jazz: i assume i need to check if cul_reason stops being populated, and cu log works? [14:09:46] Yes. [14:09:57] doing [14:10:29] It should be the default value of a string if all things go right [14:11:23] *a empty string [14:11:38] * duesen is waiting for the metric to jump [14:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:12:49] it's indeed an empty string, and cul_reason_id: 6, so...looks good to me [14:13:26] (kind of surprised for the reason ID to be that low, but i guess "testing" is not uncommon at testwiki :D) [14:13:38] * urbanecm is proceeding [14:13:44] Yeah. I was going to say it had to be a fairly common reason [14:13:44] ^^ [14:14:30] uhh. scap's full of red text. [14:14:38] (03CR) 10Jaime Nuche: jenkins: remove redundant class parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/885333 (owner: 10Jaime Nuche) [14:15:13] this is the text https://www.irccloud.com/pastebin/mLQSHgH0/ [14:16:16] ...and it proceeds with rolling things back [14:16:21] (03PS1) 10Jbond: apereo_cas: move merge strategy to lookup_options [puppet] - 10https://gerrit.wikimedia.org/r/885356 [14:16:25] wonderful. [14:17:32] can i please get some SRE help with getting past this scap sync error? ^^ seems to be about k8s config being group-readable [14:18:26] jayme: sukhe: akosiaris: ^ please :) [14:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:53] urbanecm: uhm...looking [14:20:09] urbanecm: what host are you on? [14:20:09] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] (duration: 16m 33s) [14:20:13] jayme: deploy1002 [14:20:16] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:20:16] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:20:22] trying to do a MW deployment with scap backport [14:21:07] urbanecm: confirmed, thank you! [14:22:04] the deployment finished appservers-wise. looks like scap's rollback only affects the k8s part. [14:22:34] (03CR) 10Jbond: [C: 03+2] apereo_cas: move merge strategy to lookup_options [puppet] - 10https://gerrit.wikimedia.org/r/885356 (owner: 10Jbond) [14:23:25] Amir1: parsoid cache writs are now at 25% [14:23:44] yup, thanks. Do you feel like reviewing this patch for the mobile clean up? [14:25:11] (03PS19) 10Jbond: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [14:25:15] urbanecm: I'd assume a temporary error, can you try again [14:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:08] maybe a race and the repo cache got corrupted during your scap run...fetching the charts works for me now [14:26:30] ack, trying again. [14:26:56] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] [14:27:01] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:27:02] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:28:10] Do you need me to test my changes again? [14:28:32] Dreamy_Jazz: nope, I'm just re-running the sync to ensure it's actually synced out everywhere (incl. k8s) [14:28:42] !log urbanecm@deploy1002 dreamyjazz and urbanecm and daniel: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwde [14:28:43] bug1002.eqiad.wmnet [14:28:46] Thanks. [14:28:49] proceeding [14:30:12] (03CR) 10David Caro: [C: 03+1] P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [14:30:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:46] (03CR) 10David Caro: [C: 03+1] P:wmcs::metricsinfra: add internal name for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/871291 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [14:31:07] (03CR) 10David Caro: [C: 03+1] P:wmcs::metricsinfra::grafana: configure data sources [puppet] - 10https://gerrit.wikimedia.org/r/871292 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [14:31:09] urbanecm: the group-readable config is a red herring btw. It's just a "security warning" that helm spits out (for every command you run via helmfile) [14:31:49] makes sense, thanks for the explanation/help. [14:31:53] (03CR) 10Jbond: [C: 03+1] "lgtm some minor comments" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [14:32:05] it got past the k8s steps w/o any errors on the second try [14:32:35] (03CR) 10Jbond: jenkins: remove redundant class parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/885333 (owner: 10Jaime Nuche) [14:32:37] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd1004.eqiad.wmnet with OS bullseye [14:33:08] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd1005.eqiad.wmnet with OS bullseye [14:33:28] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd1006.eqiad.wmnet with OS bullseye [14:33:55] nice, thanks! [14:34:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:885041|Disable write old for CheckUserLog reason field for testwiki (T233004)]], [[gerrit:885051|Remove redundant definition of wgCheckUserEnableSpecialInvestigate]], [[gerrit:885337|Bump parsoid parser cache writes to 25%. (T320534)]] (duration: 07m 23s) [14:34:27] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:34:27] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:34:30] and it's all done now [14:35:20] (03PS1) 10Ottomata: mw-page-content-change-enrichment - v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885357 (https://phabricator.wikimedia.org/T325305) [14:35:40] Thanks! [14:35:47] np [14:38:25] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885357 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [14:38:44] (03PS1) 10Dreamy Jazz: Disable write old for CheckUserLog reason on group 0 and group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885358 (https://phabricator.wikimedia.org/T233004) [14:39:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST csinodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:18] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd1006.eqiad.wmnet with reason: host reimage [14:41:22] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd1004.eqiad.wmnet with reason: host reimage [14:41:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd1005.eqiad.wmnet with reason: host reimage [14:42:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2035.codfw.wmnet with OS bullseye [14:42:14] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2035.codfw.wmnet with OS bullseye [14:42:53] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrichment - v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885357 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [14:43:38] (03PS1) 10Dreamy Jazz: Disable write old for CheckUserLog reason everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) [14:43:43] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd1006.eqiad.wmnet with reason: host reimage [14:44:22] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:44:25] (03CR) 10Bking: [C: 03+1] miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667) (owner: 10Gehel) [14:44:26] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:44:30] (03PS1) 10Stevemunene: Add authzIdentity to jaas config [deployment-charts] - 10https://gerrit.wikimedia.org/r/885360 (https://phabricator.wikimedia.org/T327884) [14:45:29] (03PS4) 10Superpes15: Add mobile wordmark to cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884934 (https://phabricator.wikimedia.org/T328357) [14:45:31] (03PS1) 10Ottomata: mw-page-content-change-enrichment - use correct image verison v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885361 (https://phabricator.wikimedia.org/T327494) [14:45:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mw-page-content-change-enrichment - use correct image verison v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885361 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [14:46:18] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd1004.eqiad.wmnet with reason: host reimage [14:46:21] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:46:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:48:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd1005.eqiad.wmnet with reason: host reimage [14:50:35] (03CR) 10David Caro: P:metricsinfra: add profile and role for a Grafana server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [14:55:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubestagetcd1006.eqiad.wmnet with OS bullseye [14:56:56] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubestagetcd1004.eqiad.wmnet with OS bullseye [14:57:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:01:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2035.codfw.wmnet with reason: host reimage [15:01:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubestagetcd1005.eqiad.wmnet with OS bullseye [15:02:28] (03CR) 10Majavah: P:metricsinfra: add profile and role for a Grafana server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:02:56] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:04:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2035.codfw.wmnet with reason: host reimage [15:09:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:15:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:15:25] this is me [15:15:32] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:34] (pybal backend) [15:15:34] thanks [15:15:59] sorry for telling so late - had to jump in a meeting [15:16:23] (03PS1) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [15:16:24] np, I figured it was from your earlier work and also that you would have noticed it [15:18:22] (03CR) 10David Caro: P:metricsinfra: add profile and role for a Grafana server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:19:41] (03CR) 10JMeybohm: [C: 03+2] k8s: Update staging-eqiad to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/885302 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:19:54] (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [15:20:58] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagemaster1001.eqiad.wmnet with OS bullseye [15:21:30] (03PS2) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [15:23:42] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1003.eqiad.wmnet with OS bullseye [15:24:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2035.codfw.wmnet with OS bullseye [15:24:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2035.codfw.wmnet with OS bullseye completed: - cp2035 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [15:25:03] (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [15:26:38] (03PS2) 10Southparkfan: rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) [15:26:59] (03CR) 10CI reject: [V: 04-1] rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [15:27:08] jouncebot: nowandnext [15:27:08] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [15:27:08] In 1 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1700) [15:27:14] (03CR) 10Ladsgroup: [C: 03+2] Set 'groupLoadsBySection' for s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885058 (https://phabricator.wikimedia.org/T326980) (owner: 10Zabe) [15:27:41] (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [15:28:03] (03CR) 10Reedy: [C: 03+1] Document the '+' pattern for specifying wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza) [15:28:07] (03CR) 10Majavah: P:metricsinfra: add profile and role for a Grafana server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:28:11] (03Merged) 10jenkins-bot: Set 'groupLoadsBySection' for s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885058 (https://phabricator.wikimedia.org/T326980) (owner: 10Zabe) [15:29:10] (03PS1) 10Bking: flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885365 (https://phabricator.wikimedia.org/T304914) [15:30:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:885058|Set 'groupLoadsBySection' for s11 (T326980)]] [15:30:14] T326980: PHP Notice: Undefined index: s11 - https://phabricator.wikimedia.org/T326980 [15:31:25] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1001.eqiad.wmnet with reason: host reimage [15:32:00] !log ladsgroup@deploy1002 ladsgroup and zabe: Backport for [[gerrit:885058|Set 'groupLoadsBySection' for s11 (T326980)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [15:32:33] (03PS3) 10Southparkfan: rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) [15:33:00] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10isarantopoulos) @elukey I closed this task since your change has already been merged and deployed. [15:33:55] (03CR) 10David Caro: [C: 03+2] P:metricsinfra: add profile and role for a Grafana server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:34:06] (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:34:14] (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra: add internal name for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/871291 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:34:18] (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra::grafana: configure data sources [puppet] - 10https://gerrit.wikimedia.org/r/871292 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:34:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1001.eqiad.wmnet with reason: host reimage [15:34:34] (03CR) 10DCausse: [C: 03+1] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885365 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:35:43] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [15:35:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:17] (03CR) 10Muehlenhoff: [C: 03+2] Split Swift cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [15:37:07] (03CR) 10Muehlenhoff: [C: 03+2] Update cloudbastion rules for install2004 [puppet] - 10https://gerrit.wikimedia.org/r/885304 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [15:38:46] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [15:39:58] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:885058|Set 'groupLoadsBySection' for s11 (T326980)]] (duration: 09m 49s) [15:40:03] T326980: PHP Notice: Undefined index: s11 - https://phabricator.wikimedia.org/T326980 [15:40:17] (03PS1) 10Ottomata: Define dse_kubepod_networks in network constants and in ferm defs [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) [15:40:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:30] (03PS1) 10Ottomata: Allow access to kafka jumbo and test from DSE k8s [puppet] - 10https://gerrit.wikimedia.org/r/885367 (https://phabricator.wikimedia.org/T325305) [15:43:47] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39347/console" [puppet] - 10https://gerrit.wikimedia.org/r/885367 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [15:46:35] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:49:46] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubestagemaster1001.eqiad.wmnet with OS bullseye [15:50:20] (03CR) 10Btullis: Define dse_kubepod_networks in network constants and in ferm defs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) (owner: 10Ottomata) [15:51:41] (03PS2) 10Ottomata: Define dse_kubepod_networks in network constants and in ferm defs [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) [15:52:19] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) (owner: 10Ottomata) [15:52:23] (03CR) 10Ottomata: Define dse_kubepod_networks in network constants and in ferm defs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) (owner: 10Ottomata) [15:54:22] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1004.eqiad.wmnet with OS bullseye [15:54:56] (03CR) 10Bking: [C: 03+2] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885365 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:55:16] (03CR) 10Bking: [V: 03+2 C: 03+2] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885365 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [15:55:18] (03CR) 10Ottomata: [C: 03+2] Define dse_kubepod_networks in network constants and in ferm defs [puppet] - 10https://gerrit.wikimedia.org/r/885366 (https://phabricator.wikimedia.org/T328447) (owner: 10Ottomata) [15:55:28] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Allow access to kafka jumbo and test from DSE k8s [puppet] - 10https://gerrit.wikimedia.org/r/885367 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [15:55:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet,service=cdn [15:55:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet,service=ats-be [15:55:40] (03PS2) 10Ottomata: Allow access to kafka jumbo and test from DSE k8s [puppet] - 10https://gerrit.wikimedia.org/r/885367 (https://phabricator.wikimedia.org/T325305) [15:55:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [15:56:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS bullseye [15:56:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye [15:56:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS bullseye [15:57:02] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS bullseye [15:57:07] (03CR) 10Btullis: [C: 03+1] Allow access to kafka jumbo and test from DSE k8s [puppet] - 10https://gerrit.wikimedia.org/r/885367 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [15:57:15] moritzm: am puppet-merging your 'Update cloudbastion rules for install2004' change. [15:57:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [15:58:05] (03CR) 10JMeybohm: [C: 03+2] Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885297 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:58:11] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [15:58:49] (03PS1) 10Filippo Giunchedi: scap: shorten timeout on target bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/885369 [15:58:51] (03PS1) 10Filippo Giunchedi: pontoon: default to not block_abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/885370 [15:58:52] a little gerrit spam incoming, sorry [15:58:53] (03PS1) 10Filippo Giunchedi: pontoon: update o11y with opensearch roles and settings [puppet] - 10https://gerrit.wikimedia.org/r/885371 [15:58:55] (03PS1) 10Filippo Giunchedi: opensearch: move to /run/ [puppet] - 10https://gerrit.wikimedia.org/r/885372 [15:58:57] (03PS1) 10Filippo Giunchedi: opensearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/885373 [16:00:05] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [16:00:07] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: sync [16:00:26] (03PS3) 10Giuseppe Lavagetto: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 [16:00:37] ottomata: ack, thx [16:00:43] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:00:44] (03CR) 10Giuseppe Lavagetto: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [16:01:12] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:01:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2032.codfw.wmnet with OS bullseye [16:01:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2032.codfw.wmnet with OS bullseye [16:03:29] (03CR) 10Filippo Giunchedi: "If this looks good I think we can port the same changes to elasticsearch too" [puppet] - 10https://gerrit.wikimedia.org/r/885373 (owner: 10Filippo Giunchedi) [16:04:09] (03CR) 10Filippo Giunchedi: "Ditto as I56f976f65d I think this could/should be ported to elasticsearch too" [puppet] - 10https://gerrit.wikimedia.org/r/885372 (owner: 10Filippo Giunchedi) [16:05:03] (03Merged) 10jenkins-bot: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885297 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [16:05:29] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [16:06:03] (03PS2) 10Herron: logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) [16:06:37] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [16:07:37] (03CR) 10Herron: [C: 03+2] logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron) [16:08:48] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10EChetty) [16:08:51] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10EChetty) Maintainers of AQS... [16:08:53] (03Merged) 10jenkins-bot: logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron) [16:09:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [16:10:51] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10EChetty) [16:11:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10EChetty) [16:12:18] (03CR) 10Dzahn: [C: 03+2] releases: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884392 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [16:13:46] (03CR) 10Jbond: [C: 03+1] setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538 (owner: 10Volans) [16:14:10] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10EChetty) [16:14:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1003.eqiad.wmnet with OS bullseye [16:14:22] (03CR) 10EoghanGaffney: [C: 03+2] Send exim mail.{log,info,warn,err} to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [16:14:29] (03PS2) 10EoghanGaffney: Send exim mail.{log,info,warn,err} to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) [16:14:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10EChetty) [16:15:08] (03CR) 10Filippo Giunchedi: "Also when investigating this I have come to doubt whether ExecStartPre is actually effective/needed: by default it is executed as the unit" [puppet] - 10https://gerrit.wikimedia.org/r/885373 (owner: 10Filippo Giunchedi) [16:16:19] (03PS1) 10Dzahn: releases: fix IP family parameter name in blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/885376 [16:16:47] (03PS2) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538 [16:16:53] (03CR) 10Volans: [C: 03+2] setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538 (owner: 10Volans) [16:17:20] (03CR) 10Volans: [C: 03+2] setup.py: force a newer sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/883540 (owner: 10Volans) [16:17:55] (03CR) 10Dzahn: [C: 03+2] releases: fix IP family parameter name in blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/885376 (owner: 10Dzahn) [16:18:12] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5018.eqsin.wmnet with OS bullseye [16:18:14] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS bullseye [16:18:18] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye executed with errors: - cp5018 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:18:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS bullseye executed with errors: - cp5028 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:18:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS bullseye [16:18:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS bullseye [16:18:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS bullseye [16:18:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye [16:19:48] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:20:19] (03PS1) 10DCausse: flink-app: do not set "taskmanager.numberOfTaskSlots" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885377 [16:20:28] (03Merged) 10jenkins-bot: setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538 (owner: 10Volans) [16:20:35] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:21:17] (03CR) 10Jaime Nuche: [C: 03+1] scap: shorten timeout on target bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/885369 (owner: 10Filippo Giunchedi) [16:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:23:52] (03Merged) 10jenkins-bot: setup.py: force a newer sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/883540 (owner: 10Volans) [16:27:05] (03CR) 10Filippo Giunchedi: [C: 03+2] scap: shorten timeout on target bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/885369 (owner: 10Filippo Giunchedi) [16:28:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1004.eqiad.wmnet with OS bullseye [16:29:13] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5019.eqsin.wmnet,service=cdn [16:29:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5019.eqsin.wmnet,service=ats-be [16:29:33] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Grants:Programs/Wikimedia Community Fund" "Grants:Programs/Wikimedia Community Fund/General Support Fund" "Zabe" --reason "per request [[:phab:T328456|T328456]]" --skip-subpages # T328456 [16:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:37] T328456: Move translatable page Grants:Programs/Wikimedia Community Fund - https://phabricator.wikimedia.org/T328456 [16:35:24] PROBLEM - Host cp5019 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:32] er ok, downtiming it too [16:35:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 5:00:00 on cp5019.eqsin.wmnet with reason: testing reimaging cookbook stalling failure [16:36:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on cp5019.eqsin.wmnet with reason: testing reimaging cookbook stalling failure [16:37:36] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:37:55] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:38:41] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:38:49] (03PS4) 10Giuseppe Lavagetto: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 [16:39:57] (03CR) 10CI reject: [V: 04-1] sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [16:40:00] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:41:45] (03PS1) 10Ottomata: mw--page-content-change-enrichment - increase memory in dse k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/885382 (https://phabricator.wikimedia.org/T325305) [16:41:50] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:42:12] (03CR) 10Ottomata: [C: 03+2] mw--page-content-change-enrichment - increase memory in dse k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/885382 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [16:42:37] (03CR) 10Cwhite: [C: 03+2] conftool-data: add logstash[12]032 to kibana7 backend [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite) [16:43:26] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:43:48] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:44:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:44:42] !log cwhite@cumin2002 conftool action : set/weight=10; selector: name=logstash1032.eqiad.wmnet [16:45:03] !log cwhite@cumin2002 conftool action : set/weight=10; selector: name=logstash2032.codfw.wmnet [16:46:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mw--page-content-change-enrichment - increase memory in dse k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/885382 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [16:46:27] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:46:32] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:48:38] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Puppet has been disabled for 604958 seconds, message: test - dzahn, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:48:43] (03PS1) 10Dzahn: etherpad: use correct port number for blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/885383 (https://phabricator.wikimedia.org/T327974) [16:49:13] oh, did I disable puppet on a random mw server and forget? fixing that [16:49:19] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2032.codfw.wmnet with OS bullseye [16:49:23] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2032.codfw.wmnet with OS bullseye executed with errors: - cp2032 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:49:40] !log mw2271 - renabling disabled puppet [16:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2032.codfw.wmnet with OS bullseye [16:50:02] (03PS1) 10Ottomata: mw-page-content-change-enrichment - lower mem usage to match k8s limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/885384 (https://phabricator.wikimedia.org/T325305) [16:50:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2032.codfw.wmnet with OS bullseye [16:51:22] (03CR) 10Ottomata: "Otherwise:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885384 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [16:51:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mw-page-content-change-enrichment - lower mem usage to match k8s limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/885384 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [16:52:08] !log cwhite@deploy1002 Started deploy [releng/phatality@e0bb573]: (no justification provided) [16:52:13] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:52:18] !log cwhite@deploy1002 Finished deploy [releng/phatality@e0bb573]: (no justification provided) (duration: 00m 10s) [16:52:21] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [16:52:30] !log cwhite@deploy1002 Started deploy [releng/phatality@e0bb573]: (no justification provided) [16:52:41] !log cwhite@deploy1002 Finished deploy [releng/phatality@e0bb573]: (no justification provided) (duration: 00m 11s) [16:52:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:54:18] RECOVERY - puppet last run on mw2271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:54:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [16:54:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [16:55:34] (03PS3) 10Cwhite: logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) [16:56:13] (03CR) 10Cwhite: [C: 03+2] logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [16:57:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [16:58:12] (03CR) 10Cwhite: [C: 03+2] logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 (owner: 10Cwhite) [16:58:19] (03PS3) 10Cwhite: logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 [16:59:06] (03CR) 10DCausse: "we want to customize this value to use allow more tasks to run per pods. Did not put in values.yaml as I'm not sure if it's something the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/885377 (owner: 10DCausse) [16:59:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [16:59:32] (03PS3) 10Cwhite: logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 [17:00:04] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1700) [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:01:02] (03PS1) 10Jelto: sre.gitlab.upgrade: remove Debian revision suffix from version check [cookbooks] - 10https://gerrit.wikimedia.org/r/885385 (https://phabricator.wikimedia.org/T323569) [17:02:15] (03CR) 10Cwhite: [C: 03+2] logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 (owner: 10Cwhite) [17:02:35] (03PS3) 10Cwhite: logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 [17:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:03:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp5019.eqsin.wmnet [17:03:58] (03CR) 10Cwhite: [C: 03+2] logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite) [17:04:27] (03CR) 10Jelto: [C: 03+1] "lgtm together with I97f5f27991d5cecda3fe5a2b927cade329ebeded" [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [17:05:11] (03CR) 10Dzahn: [C: 03+1] Send exim mail.{log,info,warn,err} to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/885332 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [17:05:29] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2032.codfw.wmnet with reason: host reimage [17:08:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2032.codfw.wmnet with reason: host reimage [17:09:25] RECOVERY - Host cp5019 is UP: PING WARNING - Packet loss = 90%, RTA = 225.37 ms [17:12:07] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:22] (03CR) 10Jbond: "i think ill rework this to use the multihttpush uri instead" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [17:13:49] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/885360 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [17:14:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp5019.eqsin.wmnet [17:16:28] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10Aklapper) @isarantopoulos: Hi, this task is still open. If this task is resolved, please set the task status to `resolved`. Thanks a lot! [17:28:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2032.codfw.wmnet with OS bullseye [17:28:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2032.codfw.wmnet with OS bullseye completed: - cp2032 (**PASS**) - Removed from Puppet and PuppetDB if present -... [17:29:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5029.eqsin.wmnet,service=cdn [17:29:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5029.eqsin.wmnet,service=ats-be [17:29:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet,service=cdn [17:29:34] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet,service=ats-be [17:29:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on cp5029.eqsin.wmnet with reason: testing reimaging cookbook stalling failure [17:30:03] (03CR) 10Volans: [C: 03+1] "Makes sense without overcomplicating it using the debian versioning scheme" [cookbooks] - 10https://gerrit.wikimedia.org/r/885385 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [17:30:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp5029.eqsin.wmnet with reason: testing reimaging cookbook stalling failure [17:30:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5028.eqsin.wmnet with OS bullseye [17:30:31] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS bullseye completed: - cp5028 (**PASS**) - Removed from Puppet and PuppetDB if present -... [17:31:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:31:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5028.eqsin.wmnet,service=cdn [17:31:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5028.eqsin.wmnet,service=ats-be [17:33:10] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10RLazarus) 05Open→03Resolved [17:33:14] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet [17:34:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS bullseye [17:34:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:34:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye completed: - cp5018 (**PASS**) - Removed from Puppet and PuppetDB if present -... [17:34:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:34:58] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [17:35:23] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5018.eqsin.wmnet,service=cdn [17:35:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5018.eqsin.wmnet,service=ats-be [17:36:45] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:37:48] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1076.eqiad.wmnet [17:38:04] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp1076.eqiad.wmnet [17:38:21] (03PS1) 10Jdrewniak: Add cswiki to desktop-improvements group. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885391 (https://phabricator.wikimedia.org/T328154) [17:38:44] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10Andrew) We have a ton of rebalancing to do for each of these switches. The C8 deadline we can meet but can we ge... [17:38:55] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1090.eqiad.wmnet [17:39:02] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp1090.eqiad.wmnet [17:41:49] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:42:16] (03PS3) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [17:45:56] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [17:46:48] (03CR) 10Ahmon Dancy: "joe, this chart is still referenced from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/image-suggestion-api/+/refs/hea" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto) [17:46:58] (03PS1) 10Bking: flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885392 (https://phabricator.wikimedia.org/T304914) [17:47:05] (03CR) 10CI reject: [V: 04-1] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885392 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [17:47:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp5019.eqsin.wmnet [17:47:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp5019.eqsin.wmnet [17:50:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2034.codfw.wmnet with OS bullseye [17:50:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2034.codfw.wmnet with OS bullseye [17:52:46] (03PS11) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [17:52:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=cdn [17:52:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [17:52:48] (03PS11) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [17:52:50] (03PS3) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [17:52:52] (03PS4) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [17:52:54] (03PS3) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [17:52:56] (03PS2) 10DCausse: flink-app: do not set "taskmanager.numberOfTaskSlots" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885377 [17:53:01] !log depool cp1075.eqiad.wmnet for iDRAC firmware testing: T321309 [17:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:05] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [17:55:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS bullseye [17:55:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp5029.eqsin.wmnet with OS bullseye [17:56:16] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [17:56:19] (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [17:56:24] (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [17:56:28] (03CR) 10CI reject: [V: 04-1] redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [17:56:34] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [17:57:27] (03PS1) 10Bking: flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885394 (https://phabricator.wikimedia.org/T304914) [17:57:53] (03Abandoned) 10Bking: flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885392 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1800) [18:00:57] (03PS12) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [18:01:00] (03PS12) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [18:01:01] (03PS4) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [18:01:03] (03PS5) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:01:05] (03PS4) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [18:01:17] (03PS1) 10Nray: Enable ClientPreferences for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885395 (https://phabricator.wikimedia.org/T327979) [18:04:15] (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [18:04:36] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [18:04:38] (03CR) 10CI reject: [V: 04-1] redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [18:04:40] (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [18:04:42] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [18:05:33] (03CR) 10DCausse: [C: 03+1] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885394 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [18:05:48] (03PS13) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [18:06:01] (03PS13) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [18:06:19] (03PS5) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [18:06:27] (03PS6) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:07:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [18:07:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [18:09:47] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2034.codfw.wmnet with reason: host reimage [18:09:56] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [18:10:35] (03CR) 10Bking: [C: 03+2] flink-rdf-streaming-updater: use S3 instead of swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/885394 (https://phabricator.wikimedia.org/T304914) (owner: 10Bking) [18:10:44] (03CR) 10Jbond: [C: 03+2] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [18:10:49] (03CR) 10Jbond: [C: 03+2] redfish: store all OOB info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [18:10:54] (03CR) 10Jbond: [C: 03+2] redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [18:12:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2034.codfw.wmnet with reason: host reimage [18:14:06] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [18:14:08] (03CR) 10CI reject: [V: 04-1] redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [18:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:12] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5020.eqsin.wmnet with OS bullseye [18:19:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors: - cp5020 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [18:19:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=cdn [18:19:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [18:20:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [18:20:09] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [18:21:19] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp1075.eqiad.wmnet [18:21:24] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp1075.eqiad.wmnet [18:22:28] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1075.eqiad.wmnet'] [18:22:33] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1075.eqiad.wmnet'] [18:23:20] (03PS2) 10Sbailey: Enable Linter write namespace, tag and template for group0 and group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) [18:24:09] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1075'] [18:24:15] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1075'] [18:25:09] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:25:59] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:26:37] !log gitlab-prod-1001.devtools (cloud) - ip addr del 172.16.7.146/21 dev eth0 - T318521 [18:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:41] T318521: Migrate gitlab-test instance to bullseye - https://phabricator.wikimedia.org/T318521 [18:32:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2034.codfw.wmnet with OS bullseye [18:32:34] (03CR) 10Jdlrobson: [C: 03+1] Enable ClientPreferences for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885395 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [18:32:35] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2034.codfw.wmnet with OS bullseye completed: - cp2034 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:34:45] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) CS0907837: > Support, > > We have three items for remote hands to accomplish for us on this request: > > 1) Please pickup DEL0117661, unpackage it into our ra... [18:42:40] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5020.eqsin.wmnet with OS bullseye [18:42:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors: - cp5020 (**FAIL**) - Removed from Puppet and PuppetDB if p... [18:42:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [18:42:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [18:44:36] !log gitlab-prod-1001.devtools (cloud) - rebooted VM ; ip addr del 172.16.7.146/32 dev eth0 - T318521 [18:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:41] T318521: Migrate gitlab-test instance to bullseye - https://phabricator.wikimedia.org/T318521 [18:44:54] (03PS7) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:46:05] (03PS14) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [18:46:43] (03PS14) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [18:46:53] (03PS6) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [18:47:03] (03PS8) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:50:36] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [18:50:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=cdn [18:53:51] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be [18:53:58] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:55:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:58:04] (03CR) 10Ottomata: [C: 03+2] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885377 (owner: 10DCausse) [19:00:05] dancy and brennen: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T1900). [19:01:32] o/ [19:01:41] Pressing the buttons [19:01:58] o/ [19:02:45] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885406 (https://phabricator.wikimedia.org/T325584) [19:02:46] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885406 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:03:34] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885406 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:04:46] (03Merged) 10jenkins-bot: flink-app: do not set "taskmanager.numberOfTaskSlots" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885377 (owner: 10DCausse) [19:12:05] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.21 refs T325584 [19:12:10] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [19:15:15] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite) [19:15:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:08] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5020.eqsin.wmnet with OS bullseye [19:16:14] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors: - cp5020 (**FAIL**) - Removed from Puppet and PuppetDB if p... [19:16:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [19:17:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [19:20:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:21:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2037.codfw.wmnet with OS bullseye [19:21:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2037.codfw.wmnet with OS bullseye [19:26:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10Papaul) on cp5029 reimage steps - start reimage cookbook on one terminal - start console on another terminal on the console terminal the server pxe... [19:30:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [19:32:20] (03PS9) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [19:33:01] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) p:05Low→03Medium [19:33:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [19:35:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:35:40] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [19:36:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:36:27] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) [19:40:16] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2037.codfw.wmnet with reason: host reimage [19:43:21] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2037.codfw.wmnet with reason: host reimage [19:58:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [19:58:08] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [19:58:10] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [19:58:14] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5020.eqsin.wmnet with OS bullseye [19:58:19] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors: - cp5020 (**FAIL**) - Removed from Puppet and PuppetDB if p... [19:58:24] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [19:59:05] !log sudo rm /etc/dhcp/automation/ttyS1-115200/cp5020.conf [19:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:33] (03CR) 10RLazarus: "I don't have any strong feelings about this, but I do want httpbb to be consistent with SRE's other Python repos." [software/httpbb] - 10https://gerrit.wikimedia.org/r/885273 (owner: 10Ilias Sarantopoulos) [20:00:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [20:00:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [20:03:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2037.codfw.wmnet with OS bullseye [20:03:23] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2037.codfw.wmnet with OS bullseye completed: - cp2037 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:04:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5029.eqsin.wmnet with OS bullseye [20:04:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp5029.eqsin.wmnet with OS bullseye completed: - cp5029 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled P... [20:05:35] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet [20:05:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:06:19] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2039.codfw.wmnet with OS bullseye [20:06:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2039.codfw.wmnet with OS bullseye [20:07:21] (03PS1) 10Slyngshede: C:apereo_cas fix memberOf to group mapping in OIDC. [puppet] - 10https://gerrit.wikimedia.org/r/885415 [20:09:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5029.eqsin.wmnet,service=cdn [20:09:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5029.eqsin.wmnet,service=ats-be [20:09:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [20:11:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2036.codfw.wmnet with OS bullseye [20:11:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2036.codfw.wmnet with OS bullseye [20:12:07] (03PS1) 10Zabe: Stop writing to cuc_user and cuc_user_text in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885416 (https://phabricator.wikimedia.org/T233004) [20:19:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:25:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage [20:28:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage [20:30:04] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2036.codfw.wmnet with reason: host reimage [20:33:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2036.codfw.wmnet with reason: host reimage [20:37:01] (03PS2) 10Brian Wolff: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) [20:44:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:52] !log start running "foreachwikiindblist s5.dblist migrateRevisionCommentTemp.php --sleep 2" in screen # T275246 [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:58] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [20:47:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2039.codfw.wmnet with OS bullseye [20:47:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2039.codfw.wmnet with OS bullseye completed: - cp2039 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:50:32] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10VirginiaPoundstone) @KOfori sre and data persistence tagged. Thank you for your guidance. [20:52:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2036.codfw.wmnet with OS bullseye [20:52:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2036.codfw.wmnet with OS bullseye completed: - cp2036 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:54:16] (03CR) 10Ollie Shotton: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [20:55:21] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10kostajh) >>! In T324675#857... [20:57:15] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5020.eqsin.wmnet with OS bullseye [20:57:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors: - cp5020 (**FAIL**) - Removed from Puppet and PuppetDB if p... [20:58:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [20:58:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye [21:00:01] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) 05In progress→03Resolved All tables resulted ok from the check, comparing eqiad, its codfw primary and itself on the last 4million... [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T2100). [21:00:04] sbailey, nray, and bawolff: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] Woo [21:00:27] o/ [21:00:34] I am here :-) I think [21:00:58] I can start the deploy window, but I'll need to hand it off if it goes long. [21:02:27] sbailey, I'll do yours first. [21:02:35] ok, ready [21:03:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:03:39] !log start UTC late backport window [21:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:04] (03Merged) 10jenkins-bot: Enable Linter write namespace, tag and template for group0 and group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:04:27] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:885046|Enable Linter write namespace, tag and template for group0 and group1 (T299612)]] [21:04:31] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:06:17] !log kindrobot@deploy1002 sbailey and kindrobot: Backport for [[gerrit:885046|Enable Linter write namespace, tag and template for group0 and group1 (T299612)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:06:40] sbailey: can you confirm? [21:07:50] waiting for sync, blocked creating page on test2wiki, going to another site. This is run from a job. Should be safe as group 0 passed fine. [21:08:29] trying meta [21:09:15] yes can do it here, give me 1 minute [21:09:50] ack [21:12:00] We are good to go [21:12:06] working on meta [21:12:17] Great, thanks! Syncing... [21:15:25] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Jcross) Thank you so much for the quick reply. Exciting!! [21:15:53] (03CR) 10Dzahn: [C: 03+2] "..for now.." [puppet] - 10https://gerrit.wikimedia.org/r/885383 (https://phabricator.wikimedia.org/T327974) (owner: 10Dzahn) [21:17:48] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:885046|Enable Linter write namespace, tag and template for group0 and group1 (T299612)]] (duration: 13m 20s) [21:17:53] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:18:36] Next up in nray if you're ready. [21:18:46] thank you, im ready! [21:19:08] Oh, actually it looks like there's a merge conflict. Could you resolve it? [21:19:40] kindrobot: let me take a look [21:19:42] bawolff: yours also has a merge conflict [21:19:54] Oh, i jut rebased it, jus a second i'll do it again [21:20:11] (03PS2) 10Nray: Enable ClientPreferences for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885395 (https://phabricator.wikimedia.org/T327979) [21:20:23] (03PS3) 10Brian Wolff: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) [21:20:38] @kindrobot should be good now [21:21:23] Great, merging... [21:22:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885395 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [21:23:00] (03Merged) 10jenkins-bot: Enable ClientPreferences for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885395 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [21:23:23] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:885395|Enable ClientPreferences for group0 (T327979)]] [21:23:28] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [21:24:04] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet [21:24:15] RoanKattouw, urbanecm, cjming, or TheresNoTime, could I hand bawolff's patch 884142 off to one of you after this one? [21:24:16] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2039.codfw.wmnet [21:24:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:25:03] !log kindrobot@deploy1002 kindrobot and nray: Backport for [[gerrit:885395|Enable ClientPreferences for group0 (T327979)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:25:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2038.codfw.wmnet with OS bullseye [21:25:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2038.codfw.wmnet with OS bullseye [21:25:29] nray: could you confirm? [21:25:37] yes, checking now [21:27:30] @kindrobot things look good, you can proceed@! [21:27:51] Thank you! Syncing... [21:29:35] Sorry bawolff, I won't be able to deploy your patch. I've got a commitment coming up, and I can't risk the deployment window running into it. [21:29:41] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2002.codfw.wmnet: Trying to induce errors - eevans@cumin1001 [21:29:59] kindrobot: no worries, it happens. Its not a particularly urgent patch [21:30:23] Great, thank you for understanding. :) [21:30:24] If someone else show up to do more in the window, please ping me :) [21:31:00] bawolff: how do you not have prod access? [21:31:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:31:26] RhinosF1: I used to, once upon a time [21:31:44] I can take a look, but only in like 30min [21:32:08] Umm, around the time i quit my job at WMF, my laptop was stolen (Prauge hackathon, it was an interesting time for me), so my access got revoked, and since i was kind of quiting anyways, i never asked for it back [21:32:52] zabe: That'd be awesome if that works out, but if not, no stress, I'll just do some other window [21:33:41] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:885395|Enable ClientPreferences for group0 (T327979)]] (duration: 10m 17s) [21:33:46] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [21:34:05] Not to mention, its very rare I do stuff that involves deploying things. Last time I participated in this process it was still called SWAT [21:34:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:34:51] I literally just downloaded the wikimedia debug toolbar ten minutes ago because i haven't needed it since i got my new laptop [21:35:18] !log close UTC late backport window. Did not deploy bawolff 884142 as I ran out of time. zabe may reopen the window in around 30 minutes to finish it out [21:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:28] thanks for your help @kindrobot ! [21:35:40] No problem, thank you everyone. :) [21:36:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2002.codfw.wmnet: Trying to induce errors - eevans@cumin1001 [21:39:02] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2002.codfw.wmnet [21:44:05] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2038.codfw.wmnet with reason: host reimage [21:44:59] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cassandra-dev2002.codfw.wmnet [21:47:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2038.codfw.wmnet with reason: host reimage [22:05:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS bullseye [22:05:23] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye completed: - cp5020 (**PASS**) - Removed from Puppet and PuppetDB if present -... [22:07:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5020.eqsin.wmnet,service=cdn [22:07:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5020.eqsin.wmnet,service=ats-be [22:07:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [22:07:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2038.codfw.wmnet with OS bullseye [22:07:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2038.codfw.wmnet with OS bullseye completed: - cp2038 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [22:10:02] (03PS1) 10Jcrespo: Add unit tests [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [22:13:14] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet [22:13:38] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2040.codfw.wmnet with OS bullseye [22:13:44] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp2040.codfw.wmnet with OS bullseye [22:13:48] (03PS4) 10Zabe: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) (owner: 10Brian Wolff) [22:13:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:14:04] bawolff, we can do this now [22:14:11] Woo. Thanks :) [22:14:31] (03CR) 10Zabe: [C: 03+2] Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) (owner: 10Brian Wolff) [22:15:16] (03Merged) 10jenkins-bot: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) (owner: 10Brian Wolff) [22:17:53] !log zabe@deploy1002 Started scap: Backport for [[gerrit:884142|Restrict flow-edit-title to autoconfirmed on mediawikiwiki (T328097)]] [22:17:58] T328097: make flow-edit-title be autoconfirm only on mediawikiwiki - https://phabricator.wikimedia.org/T328097 [22:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:45] !log zabe@deploy1002 zabe and bawolff: Backport for [[gerrit:884142|Restrict flow-edit-title to autoconfirmed on mediawikiwiki (T328097)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:20:52] zabe: I tested and confirmed it worked [22:21:01] cool, syncing [22:21:45] Although i did notice that flow is not purging varnish cache properly, which is :S [22:23:11] Oh nevermind, it is just sorted differently for me when logged out [22:23:58] (03PS2) 10Zabe: Stop writing to cuc_user and cuc_user_text in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885416 (https://phabricator.wikimedia.org/T233004) [22:24:01] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_user and cuc_user_text in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885416 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:24:58] (03Merged) 10jenkins-bot: Stop writing to cuc_user and cuc_user_text in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885416 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:26:37] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:884142|Restrict flow-edit-title to autoconfirmed on mediawikiwiki (T328097)]] (duration: 08m 43s) [22:26:42] T328097: make flow-edit-title be autoconfirm only on mediawikiwiki - https://phabricator.wikimedia.org/T328097 [22:26:46] bawolff, should be live :) [22:26:52] (03PS1) 10Zabe: Stop writing to cuc_comment in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885431 (https://phabricator.wikimedia.org/T233004) [22:26:55] Awsome. Thank you :) [22:27:09] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885431 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:28:01] (03Merged) 10jenkins-bot: Stop writing to cuc_comment in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885431 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:28:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885431 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:28:24] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885416|Stop writing to cuc_user and cuc_user_text in group0 wikis (T233004)]], [[gerrit:885431|Stop writing to cuc_comment in testwiki (T233004)]] [22:28:28] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:30:09] !log zabe@deploy1002 zabe: Backport for [[gerrit:885416|Stop writing to cuc_user and cuc_user_text in group0 wikis (T233004)]], [[gerrit:885431|Stop writing to cuc_comment in testwiki (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [22:32:35] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage [22:35:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage [22:35:58] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885416|Stop writing to cuc_user and cuc_user_text in group0 wikis (T233004)]], [[gerrit:885431|Stop writing to cuc_comment in testwiki (T233004)]] (duration: 07m 34s) [22:36:02] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:53:28] (03PS1) 10Bking: elastic: add udp_json_logback_compat_profile [puppet] - 10https://gerrit.wikimedia.org/r/885438 (https://phabricator.wikimedia.org/T324335) [22:53:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2040.codfw.wmnet with OS bullseye [22:53:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp2040.codfw.wmnet with OS bullseye completed: - cp2040 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [22:54:10] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp2040.codfw.wmnet [22:54:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:55:50] (03PS1) 10Bking: elastic: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) [22:56:06] (03CR) 10Ryan Kemper: [C: 03+1] "Looks good, ready to test on relforge" [puppet] - 10https://gerrit.wikimedia.org/r/885438 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [22:56:19] (03CR) 10Bking: [C: 03+2] elastic: add udp_json_logback_compat_profile [puppet] - 10https://gerrit.wikimedia.org/r/885438 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [22:57:17] mutante gonna merge your etherpad patch if that's cool [23:01:06] inflatador: yes, it is. sorry. got distracted [23:01:14] mutante np, it's merged [23:03:10] (03PS1) 10Bking: Revert "elastic: add udp_json_logback_compat_profile" [puppet] - 10https://gerrit.wikimedia.org/r/885320 [23:04:01] (03CR) 10Ryan Kemper: [C: 03+1] Revert "elastic: add udp_json_logback_compat_profile" [puppet] - 10https://gerrit.wikimedia.org/r/885320 (owner: 10Bking) [23:06:07] (03CR) 10Bking: [C: 03+2] Revert "elastic: add udp_json_logback_compat_profile" [puppet] - 10https://gerrit.wikimedia.org/r/885320 (owner: 10Bking) [23:06:23] (03PS1) 10JHathaway: Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) [23:08:14] (03PS2) 10Ryan Kemper: elastic: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [23:08:59] (03PS3) 10Ryan Kemper: elastic: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [23:09:19] (03CR) 10JHathaway: "kindly review" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [23:12:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3054.esams.wmnet with OS bullseye [23:13:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3054.esams.wmnet with OS bullseye [23:34:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3054.esams.wmnet with reason: host reimage [23:35:27] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3055.esams.wmnet with OS bullseye [23:35:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3055.esams.wmnet with OS bullseye [23:37:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3054.esams.wmnet with reason: host reimage [23:38:39] (03CR) 10RLazarus: "Please also add a test in test_main.py, where you pass a json_body through and assert that it's encoded correctly -- you can use test_form" [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [23:45:48] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3055.esams.wmnet with OS bullseye [23:45:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp3055.esams.wmnet with OS bullseye executed with errors: - cp3055 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [23:51:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3055.esams.wmnet with OS bullseye [23:51:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp3055.esams.wmnet with OS bullseye