[00:00:05] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T0000). [00:00:05] nn1l2: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] hi [00:01:48] Hello! I'm out on a walk so I can't do the deployment right now. I hope someone else is around to do it; if not, I'll be home in 30 mins [00:02:18] Thanks! [00:02:24] We're back-porting a train blocker right now so deployments probably should be on hold anyway. [00:03:07] no rush, I can wait for the whole window (60 mins) [00:03:24] just ping me when ready, please [00:08:36] CI is in a bad state; this window may or may not happen. [00:08:56] ^ nn1l2 [00:09:27] thanks for the notification! [00:09:39] Anyway, I will be still around [00:09:49] sure thing; apologies for the interruption in service. [00:21:15] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:24:41] !log restarting jenkins [00:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:25] 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) This is a good step forward. Thank you! I realize deployment-prep may not be in scope for this project, but we have a vested interest in keeping [[ https://beta-logs.wikimedia.org... [00:37:28] I'm back now [00:37:39] brennen: Is the CI failure specific to wmf.19? [00:38:13] RoanKattouw: it is not, but turning it off and back on seems to have resolved CI issues for the moment [00:39:33] OK. Should I use the config change for this window as a guinea pig for that, or would you prefer that we delay that change to tomorrow? [00:40:40] RoanKattouw: config change for this window seems fine, go ahead. i'm also (cc: James_F) good with landing the wmf.19 patch, but will hold rolling train forward until tomorrow. [00:40:54] Great, thanks [00:41:06] i'd prefer to have whatever else is going to break during wmf.19 on group1 do so during someone's actual workday. :) [00:41:09] nn1l2: Alright, we're in business, I'll have your config patch ready for testing in a few minutes [00:41:26] brennen: Ack. [00:41:29] (03CR) 10Catrope: [C: 03+2] commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757549 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2) [00:41:29] thanks! [00:42:08] (03CR) 10Cwhite: [V: 03+2 C: 03+2] prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [00:42:25] (03Merged) 10jenkins-bot: commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757549 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2) [00:47:15] nn1l2: Alright, test away on mwdebug1002 [00:48:43] test failed :( [00:48:45] HTTP request timed out. [00:48:46] There was a problem during the HTTP request: 0 Error [00:49:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:50:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:51] probably similar to https://phabricator.wikimedia.org/T299247#7628032 [00:51:22] second test failed too [00:51:24] I think we can revert [00:51:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:51] before reverting, maybe we can test it on mwdebug1001? [00:53:27] RoanKattouw, you still there? [00:53:51] Yes, I'm back. Sorry, I got distracted with something lese [00:54:00] Let's try 1001 [00:54:32] nn1l2: OK, 1001 is ready for testing [00:55:43] test failed again :( [00:55:55] Can we postpone reverting to tommorow? [00:56:11] I want to test it again some hours later [00:56:45] problems with Iranian urls are common :-( [00:57:05] Hah I can imagine [00:57:23] Thanks! [00:57:29] So the only broken thing is that if you try to upload from this URL you get an error, but otherwise the site works, right? [00:57:39] So, I think we are done here [00:57:41] If so, I can deploy this, and we can tweak/revert another day [00:57:43] yes exactly [00:57:47] OK [00:58:58] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757549|commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist (T300217)]] (duration: 00m 55s) [00:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:03] T300217: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300217 [01:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T0100). [01:05:08] (03CR) 10Brennen Bearnes: "recheck" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [01:11:31] RoanKattouw: just to be sure, the patch got deployed. Yeah? [01:11:40] Yes it did [01:11:44] Thanks! [01:14:38] hrm, evidently recheck comment doesn't do what i expect here. [01:17:50] it seems something is making files dissappear from that jenkins agent independant and outside of the job execution [01:18:00] half-way through a random CLI command fails with MEssageEn.php missing [01:18:15] and after that the workspace clean up command fails with shell not finding any files anywhere [01:18:32] or maybe docker losing track of the harddrive [01:18:36] Krinkle: see #-releng backscroll; there were multiple agents running due to a misconfiguration. [01:18:51] ack [01:19:10] it's running now fwiw https://integration.wikimedia.org/zuul/#q=757473 [01:19:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.23 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:21:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 103.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:22:29] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:26:06] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) >>! In T299988#7654203, @jhathaway wrote: > @MarkTraceur & @Ottomata please approve Just to note that @MarkTraveur approved above >>! In T299988#7649407, @MarkTr... [01:33:40] (03CR) 10Brennen Bearnes: [C: 04-2] "Will deploy this in local morning before rolling wmf.19 to group1." [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [01:34:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:35:45] * brennen calls it a day. [01:39:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [02:05:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:01] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [04:01:44] !log grafana: Temporarily silence resourceloader alert for INM satisfaction ratio, pending T298520. [04:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:50] T298520: Investigate INM Satisfaction alert as of 2021-12-17 - https://phabricator.wikimedia.org/T298520 [04:14:35] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:27] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:31:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:36:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [06:07:39] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2022-01-26 11:06:33 (2674 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:22:35] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.742e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [06:54:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:54:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19376 and previous config saved to /var/cache/conftool/dbconfig/20220127-065406-marostegui.json [06:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:11] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [06:55:10] (03CR) 10Elukey: "Thanks a lot Cole!" [puppet] - 10https://gerrit.wikimedia.org/r/757535 (owner: 10Cwhite) [06:55:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19377 and previous config saved to /var/cache/conftool/dbconfig/20220127-065519-marostegui.json [06:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:55:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:04:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19378 and previous config saved to /var/cache/conftool/dbconfig/20220127-070428-marostegui.json [07:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:34] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:04:42] (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757475 [07:05:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es1021', diff saved to https://phabricator.wikimedia.org/P19379 and previous config saved to /var/cache/conftool/dbconfig/20220127-070532-marostegui.json [07:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19380 and previous config saved to /var/cache/conftool/dbconfig/20220127-070557-marostegui.json [07:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:01] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:06:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:06:42] (03CR) 10Marostegui: [C: 03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757475 (owner: 10Marostegui) [07:07:35] 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) When I worked with John on T296089 we wanted to give a way to deploy bundles across realms, so it is in scope to migrate deployment-prep as well if it is a critical piece of your testin... [07:08:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 T299479', diff saved to https://phabricator.wikimedia.org/P19381 and previous config saved to /var/cache/conftool/dbconfig/20220127-070821-marostegui.json [07:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:26] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [07:08:27] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:08:57] (03PS1) 10Marostegui: db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757558 (https://phabricator.wikimedia.org/T299479) [07:10:06] (03CR) 10Marostegui: [C: 03+2] db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757558 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [07:10:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19382 and previous config saved to /var/cache/conftool/dbconfig/20220127-071023-marostegui.json [07:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1131.eqiad.wmnet with OS bullseye [07:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19383 and previous config saved to /var/cache/conftool/dbconfig/20220127-071355-marostegui.json [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:00] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:17:03] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1131.eqiad.wmnet with OS bullseye [07:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1131.eqiad.wmnet with OS bullseye [07:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19384 and previous config saved to /var/cache/conftool/dbconfig/20220127-072528-marostegui.json [07:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19385 and previous config saved to /var/cache/conftool/dbconfig/20220127-072900-marostegui.json [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19386 and previous config saved to /var/cache/conftool/dbconfig/20220127-074033-marostegui.json [07:40:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:40:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:38] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [07:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:40:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:40:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19387 and previous config saved to /var/cache/conftool/dbconfig/20220127-074101-marostegui.json [07:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19388 and previous config saved to /var/cache/conftool/dbconfig/20220127-074214-marostegui.json [07:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19389 and previous config saved to /var/cache/conftool/dbconfig/20220127-074404-marostegui.json [07:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1131.eqiad.wmnet with OS bullseye [07:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:03] (03PS1) 10Ladsgroup: Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194) [07:52:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19390 and previous config saved to /var/cache/conftool/dbconfig/20220127-075229-root.json [07:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:37] (03PS1) 10Marostegui: Revert "db1131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757477 [07:53:15] (03CR) 10Ladsgroup: [C: 03+2] Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [07:55:20] (03CR) 10Marostegui: [C: 03+2] Revert "db1131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757477 (owner: 10Marostegui) [07:57:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19391 and previous config saved to /var/cache/conftool/dbconfig/20220127-075718-marostegui.json [07:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19392 and previous config saved to /var/cache/conftool/dbconfig/20220127-075909-marostegui.json [07:59:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:59:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:14] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:13] (03Abandoned) 10Ladsgroup: Revert "rdbms: cleanup the use of QUERY_ flags to query() in Database" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [08:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19393 and previous config saved to /var/cache/conftool/dbconfig/20220127-080733-root.json [08:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:07:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:04] (03Merged) 10jenkins-bot: Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [08:12:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19394 and previous config saved to /var/cache/conftool/dbconfig/20220127-081223-marostegui.json [08:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:43] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.19/includes/libs/rdbms/database/Database.php: Backport: [[gerrit:757476|Don't consider lock waits to be write queries (T300194)]] (duration: 00m 52s) [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:47] T300194: Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 3.6s in writes, exceeding the 3s limit - https://phabricator.wikimedia.org/T300194 [08:16:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:16:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:16:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19395 and previous config saved to /var/cache/conftool/dbconfig/20220127-081622-marostegui.json [08:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:27] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:16:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:17:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:07] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:53] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19396 and previous config saved to /var/cache/conftool/dbconfig/20220127-082236-root.json [08:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19397 and previous config saved to /var/cache/conftool/dbconfig/20220127-082728-marostegui.json [08:27:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:27:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:33] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [08:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19398 and previous config saved to /var/cache/conftool/dbconfig/20220127-082735-marostegui.json [08:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:44] (03CR) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [08:27:48] (03PS3) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 [08:28:40] (03CR) 10jerkins-bot: [V: 04-1] Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [08:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19399 and previous config saved to /var/cache/conftool/dbconfig/20220127-082847-marostegui.json [08:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:53] (03CR) 10JMeybohm: [C: 04-1] "I think you might be rewriting this over and over again because you have this commit in different local branches or something. "git review" [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth) [08:33:32] !log uploaded scap 4.2.1 to apt.wikimedia.org - T300058 [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:36] T300058: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 [08:34:22] (03PS4) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 [08:37:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19400 and previous config saved to /var/cache/conftool/dbconfig/20220127-083740-root.json [08:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:37] !log updated scap to 4.2.1 on A:mw-canary, A:parsoid-canary, A:mw-jobrunner-canary, A:restbase-canary - T300058 [08:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:43] T300058: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 [08:40:58] !log jayme@deploy1002 Started deploy [restbase/deploy@0848b15]: scap testing [08:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] !log jayme@deploy1002 Finished deploy [restbase/deploy@0848b15]: scap testing (duration: 00m 05s) [08:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [08:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19401 and previous config saved to /var/cache/conftool/dbconfig/20220127-084352-marostegui.json [08:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19402 and previous config saved to /var/cache/conftool/dbconfig/20220127-085244-root.json [08:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:40] (03PS3) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) [08:58:42] (03PS1) 10Filippo Giunchedi: conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199) [08:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19403 and previous config saved to /var/cache/conftool/dbconfig/20220127-085857-marostegui.json [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:54] (03PS9) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [09:00:07] (03PS3) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 [09:01:08] (03PS3) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) [09:01:57] (03PS2) 10Filippo Giunchedi: conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199) [09:02:57] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [09:04:13] (03PS1) 10Hashar: ci: ensure rsync is on all WMCS CI agents [puppet] - 10https://gerrit.wikimedia.org/r/757613 (https://phabricator.wikimedia.org/T300236) [09:06:58] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/757613 (https://phabricator.wikimedia.org/T300236) (owner: 10Hashar) [09:07:45] (03PS10) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [09:07:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19404 and previous config saved to /var/cache/conftool/dbconfig/20220127-090747-root.json [09:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:14] (03PS2) 10DCausse: aptrepo: add an elastic68 component [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) [09:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19405 and previous config saved to /var/cache/conftool/dbconfig/20220127-091401-marostegui.json [09:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:07] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [09:14:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:14:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [09:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:14:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:14:37] (03PS4) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) [09:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:39] (03PS1) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) [09:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19406 and previous config saved to /var/cache/conftool/dbconfig/20220127-091440-marostegui.json [09:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19407 and previous config saved to /var/cache/conftool/dbconfig/20220127-091453-marostegui.json [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:07] (03PS1) 10Muehlenhoff: Remove now obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/757616 [09:16:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19408 and previous config saved to /var/cache/conftool/dbconfig/20220127-091641-marostegui.json [09:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:46] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:17:26] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) (owner: 10DCausse) [09:17:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:18:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:47] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1007 [09:20:02] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [09:22:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19409 and previous config saved to /var/cache/conftool/dbconfig/20220127-092251-root.json [09:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:38] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624 [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:43] T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624 [09:23:43] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624 [09:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:32] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10Volans) p:05Triage→03Medium [09:27:15] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus2006.codfw.wmnet [09:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:26] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus2005.codfw.wmnet [09:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:11] (03PS3) 10Marostegui: mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) [09:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19410 and previous config saved to /var/cache/conftool/dbconfig/20220127-092957-marostegui.json [09:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19411 and previous config saved to /var/cache/conftool/dbconfig/20220127-093146-marostegui.json [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [09:35:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui) [09:36:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757616 (owner: 10Muehlenhoff) [09:37:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19412 and previous config saved to /var/cache/conftool/dbconfig/20220127-093755-root.json [09:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [09:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:53] (03CR) 10Ayounsi: [C: 03+1] Add k8s masters in codfw eBGP config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:42:54] (03PS1) 10Marostegui: switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605) [09:44:08] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [09:44:38] (03Merged) 10jenkins-bot: switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [09:45:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19413 and previous config saved to /var/cache/conftool/dbconfig/20220127-094502-marostegui.json [09:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19414 and previous config saved to /var/cache/conftool/dbconfig/20220127-094651-marostegui.json [09:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:42] !log hnowlan@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [09:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:56] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 14s) [09:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [09:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19415 and previous config saved to /var/cache/conftool/dbconfig/20220127-095258-root.json [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:11] !log added ganeti1027 to Ganeti eqiad cluster T293909 [09:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:14] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [09:53:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1027.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:38] !log Stopped Bacula Director Daemon service at backup1001 T299624 [09:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:43] T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624 [09:58:49] (03PS13) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [09:59:24] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:59:51] jouncebot: nowandnext [09:59:51] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [09:59:51] In 1 hour(s) and 0 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100) [10:00:02] !log Failover m1 from db1159 to db1128 - T299624 [10:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19416 and previous config saved to /var/cache/conftool/dbconfig/20220127-100007-marostegui.json [10:00:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:00:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:11] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [10:00:11] o/ [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:14] o/ [10:00:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19417 and previous config saved to /var/cache/conftool/dbconfig/20220127-100014-marostegui.json [10:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:19] going for it [10:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:51] all done [10:00:53] checking services [10:01:05] prometheus monitoring may complain while bacula is down, that is expected [10:01:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19418 and previous config saved to /var/cache/conftool/dbconfig/20220127-100127-marostegui.json [10:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:32] etherpad might need a restart [10:01:35] on my way! [10:01:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19419 and previous config saved to /var/cache/conftool/dbconfig/20220127-100155-marostegui.json [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:02:00] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:02:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [10:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:28] etherpad is back [10:02:58] librenms looks good [10:03:03] welcome back etherpad [10:03:09] glorious victory [10:03:17] any service whose "expert" is not around I can check? [10:03:52] orchestrator is clean [10:03:59] jynus: I am thinking about cas/pki [10:04:03] but not sure how we can test them [10:04:13] jbond: moritzm can you confirm those are ok? ^ [10:04:51] jynus: want me to merge https://gerrit.wikimedia.org/r/755960 ? [10:04:52] (03CR) 10jerkins-bot: [V: 04-1] RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [10:05:00] marostegui: sure [10:05:12] (03CR) 10Marostegui: [C: 03+2] dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo) [10:05:34] jynus: done [10:06:51] ok, let me see if the checks are working on the right db [10:06:55] (03PS1) 10Filippo Giunchedi: snmp_exporter: allow polling from all prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292) [10:06:57] (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757619 (https://phabricator.wikimedia.org/T299624) [10:07:10] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Thank you for putting WMCS in the loop. In which timeline this refresh should happen? I guess nothing will be done a... [10:07:56] I guess puppet must run first [10:08:00] doing a manual run [10:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19420 and previous config saved to /var/cache/conftool/dbconfig/20220127-100802-root.json [10:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:41] bacula can be down for now until everything else is checked as working [10:08:57] (03PS1) 10Ladsgroup: maintain-views: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/757622 (https://phabricator.wikimedia.org/T299416) [10:09:00] jynus: from what I can see everything is fine apart from pki and cas that I don't know how to test [10:09:03] jbond: moritzm ^ [10:09:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33471/console" [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292) (owner: 10Filippo Giunchedi) [10:10:10] (03CR) 10Marostegui: [C: 03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757619 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui) [10:11:16] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] snmp_exporter: allow polling from all prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292) (owner: 10Filippo Giunchedi) [10:11:23] as a side note, I belive db1159 stopped paging and db1128 started FYI marostegui [10:11:40] as in, the future, not now [10:13:03] all db backups checks on the active icinga instance point now to db1128 too [10:14:34] jynus: yep, and that's good [10:15:40] dbprovs updated too after puppet run [10:15:45] will restart bacula now [10:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19421 and previous config saved to /var/cache/conftool/dbconfig/20220127-101631-marostegui.json [10:16:32] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) I don't know details myself but the relevant task is T286588 [10:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:46] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done (i.e. access from `prometheus_nodes` is implicit and no longer needed in individual profiles)... [10:16:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:17:48] !log Started Bacula Director Daemon service at backup1001 T299624 [10:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:53] T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624 [10:20:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:20:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19422 and previous config saved to /var/cache/conftool/dbconfig/20220127-102049-marostegui.json [10:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:54] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:21:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:26:06] (03CR) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:26:08] (03PS4) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 [10:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19423 and previous config saved to /var/cache/conftool/dbconfig/20220127-103136-marostegui.json [10:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:01] (03CR) 10Vgutierrez: [C: 03+2] cache: Provide a text_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/757415 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:32:35] (03PS4) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) [10:32:37] (03PS1) 10Filippo Giunchedi: hieradata: swap prometheus2003 with prometheus2005 [puppet] - 10https://gerrit.wikimedia.org/r/757623 (https://phabricator.wikimedia.org/T296199) [10:35:50] !log creating linktarget table everywhere (T299416) [10:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:55] T299416: Normalize link tables: Create linktarget table - https://phabricator.wikimedia.org/T299416 [10:36:25] (03PS1) 10Marostegui: db1159: Move it to m2 [puppet] - 10https://gerrit.wikimedia.org/r/757626 (https://phabricator.wikimedia.org/T300243) [10:37:52] (03CR) 10Muehlenhoff: [C: 03+2] Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [10:38:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1159.eqiad.wmnet with OS bullseye [10:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:16] (03Abandoned) 10Ladsgroup: maintain-views: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/757622 (https://phabricator.wikimedia.org/T299416) (owner: 10Ladsgroup) [10:39:24] (03CR) 10Marostegui: [C: 03+2] db1159: Move it to m2 [puppet] - 10https://gerrit.wikimedia.org/r/757626 (https://phabricator.wikimedia.org/T300243) (owner: 10Marostegui) [10:43:19] (03PS1) 10Vgutierrez: site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421) [10:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19424 and previous config saved to /var/cache/conftool/dbconfig/20220127-104618-marostegui.json [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:46:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19425 and previous config saved to /var/cache/conftool/dbconfig/20220127-104641-marostegui.json [10:46:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:46:46] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [10:46:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19426 and previous config saved to /var/cache/conftool/dbconfig/20220127-104654-marostegui.json [10:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006 [10:47:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006 [10:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:41] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [10:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [10:48:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:17] (03PS1) 10Ssingh: acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) [10:50:02] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33472/console" [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [10:50:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2023.codfw.wmnet with OS bullseye [10:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 T299479', diff saved to https://phabricator.wikimedia.org/P19427 and previous config saved to /var/cache/conftool/dbconfig/20220127-105223-marostegui.json [10:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [10:52:45] jouncebot: now [10:52:46] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [10:52:49] jouncebot: next [10:52:49] In 0 hour(s) and 7 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100) [10:53:02] I’ll test something on mwdebug1001 for a bit [10:54:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19428 and previous config saved to /var/cache/conftool/dbconfig/20220127-105408-marostegui.json [10:54:10] (03PS1) 10Marostegui: db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757629 (https://phabricator.wikimedia.org/T299479) [10:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:13] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [10:54:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1165.eqiad.wmnet with OS bullseye [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:05] (03CR) 10Marostegui: [C: 03+2] db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757629 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [10:55:18] (03CR) 10Filippo Giunchedi: "LGTM overall! Thank you for kickstarting this" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [10:56:26] !log sukhe@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh6001.wikimedia.org [10:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:33] (03PS1) 10Muehlenhoff: Update service entry in idp-test for Puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/757630 [10:57:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "What Cole said, and find won't recursively delete directories, so deleting files first is required and then empty directories" [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [10:58:55] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) [11:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100). [11:00:44] (I’m done on mwdebug1001 again) [11:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19429 and previous config saved to /var/cache/conftool/dbconfig/20220127-110123-marostegui.json [11:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] (03CR) 10Muehlenhoff: Update service entry in idp-test for Puppetboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757630 (owner: 10Muehlenhoff) [11:07:07] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh6001.wikimedia.org [11:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:58] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [11:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19431 and previous config saved to /var/cache/conftool/dbconfig/20220127-110913-marostegui.json [11:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1159.eqiad.wmnet with OS bullseye [11:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:56] (03PS1) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) [11:10:12] (03PS1) 10Muehlenhoff: Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632 [11:10:50] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:10:55] ^ expected [11:11:04] 10SRE, 10Math, 10Wikimedia-Mailing-lists: New mailing list for Wikimedia community group math - https://phabricator.wikimedia.org/T300239 (10Physikerwelt) >>! In T300239#7655671, @Ladsgroup wrote: > - Do you want a public mailing list with archives? Just to confirm Yes! This is most crucial. Similar to the... [11:11:59] !log depool cp4031 to be reimaged as cache::text_envoy - T271421 [11:12:00] (03PS1) 10Ssingh: install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T283192) [11:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:12:13] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:13:21] (03PS2) 10Ssingh: install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T300156) [11:13:39] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:14:19] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10elukey) Thanks a lot for working on this. Can we also remove the old lib from apt and puppet? ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent libvarnishapi1 libvarnishapi1 | 5.1.3-1wm11 | stretch-wikimed... [11:14:57] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:14:57] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:15:05] (03CR) 10Ssingh: [C: 03+2] install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [11:15:17] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:03] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:03] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19432 and previous config saved to /var/cache/conftool/dbconfig/20220127-111628-marostegui.json [11:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:33] (03PS1) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634 [11:18:59] (03PS1) 10Marostegui: Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757478 [11:19:09] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:19:31] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:19:31] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19433 and previous config saved to /var/cache/conftool/dbconfig/20220127-112057-root.json [11:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19434 and previous config saved to /var/cache/conftool/dbconfig/20220127-112418-marostegui.json [11:24:20] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:24:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1165.eqiad.wmnet with OS bullseye [11:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:34] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:24:37] (03PS2) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634 [11:24:47] (03PS2) 10Vgutierrez: site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421) [11:25:22] (03PS1) 10Ssingh: Add Wikidough's /24 to bgp_out in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/757635 [11:28:55] (03PS1) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300158) [11:29:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2023.codfw.wmnet with OS bullseye [11:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4031.ulsfo.wmnet with OS buster [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:33] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster [11:29:54] (03PS2) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156) [11:31:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19435 and previous config saved to /var/cache/conftool/dbconfig/20220127-113132-marostegui.json [11:31:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:31:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:38] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T285149)', diff saved to https://phabricator.wikimedia.org/P19436 and previous config saved to /var/cache/conftool/dbconfig/20220127-113140-marostegui.json [11:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: introduce check to verify default grid release [puppet] - 10https://gerrit.wikimedia.org/r/757499 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:34:17] (03PS3) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634 [11:34:40] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) @elukey, thanks for pointing that out. Yes we can do that, considering `libvarnishapi1` was only required when we were using `varnish 5.1.x` of which we no longer use in production, it is safe to hav... [11:36:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19437 and previous config saved to /var/cache/conftool/dbconfig/20220127-113600-root.json [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19438 and previous config saved to /var/cache/conftool/dbconfig/20220127-113924-marostegui.json [11:39:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:39:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:29] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [11:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19439 and previous config saved to /var/cache/conftool/dbconfig/20220127-113931-marostegui.json [11:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19440 and previous config saved to /var/cache/conftool/dbconfig/20220127-114044-marostegui.json [11:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:08] (03PS3) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156) [11:43:17] (03CR) 10MMandere: [C: 03+2] site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [11:44:32] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:45:48] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:45:50] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:46:48] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:47:24] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:50:29] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:50:32] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:50:54] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634 (owner: 10Muehlenhoff) [11:51:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19441 and previous config saved to /var/cache/conftool/dbconfig/20220127-115105-root.json [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:34] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:52:36] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:53:25] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGk3CgXqD8AxkboJ22zxWQ1CYDhaRuSgiV2A32G+Z9SL aniket@ars [11:55:14] (03CR) 10Ayounsi: [C: 03+1] "LGTM, note that this won't be effective for now as we don't have the routers yet." [homer/public] - 10https://gerrit.wikimedia.org/r/757635 (owner: 10Ssingh) [11:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19442 and previous config saved to /var/cache/conftool/dbconfig/20220127-115548-marostegui.json [11:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:14] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:57:18] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:57:29] (03CR) 10Marostegui: [C: 03+2] Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757478 (owner: 10Marostegui) [12:00:05] Amir1, Lucas_WMDE, and apergos: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1200). [12:00:30] looks like there’s nothing to deploy and nobody to train this time [12:00:33] no patches in the window, no trainees signed up [12:00:38] ok [12:00:57] see you next time :-D [12:01:02] ^^ [12:01:06] * Lucas_WMDE goes for lunch [12:01:15] going to make soup! [12:01:27] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [12:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:56] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:03:00] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:03:22] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:04:35] (03PS1) 10Ladsgroup: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757638 (https://phabricator.wikimedia.org/T300006) [12:05:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757638 (https://phabricator.wikimedia.org/T300006) (owner: 10Ladsgroup) [12:06:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19443 and previous config saved to /var/cache/conftool/dbconfig/20220127-120608-root.json [12:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [12:06:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [12:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19444 and previous config saved to /var/cache/conftool/dbconfig/20220127-120648-ladsgroup.json [12:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:53] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [12:09:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4031.ulsfo.wmnet with OS buster [12:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:37] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster completed: - cp4031 (**WARN*... [12:10:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19445 and previous config saved to /var/cache/conftool/dbconfig/20220127-121053-marostegui.json [12:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:00] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:04] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:10] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:34] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:50] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:50] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:20:20] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:20:24] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:20:32] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:20:58] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:21:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19446 and previous config saved to /var/cache/conftool/dbconfig/20220127-122113-root.json [12:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19447 and previous config saved to /var/cache/conftool/dbconfig/20220127-122157-root.json [12:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:13] (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) [12:25:33] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2011.codfw.wmnet [12:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19448 and previous config saved to /var/cache/conftool/dbconfig/20220127-122558-marostegui.json [12:26:00] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:03] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [12:26:06] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts restbase2011.codfw.wmnet [12:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:46] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:29:04] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:29:18] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:29:18] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:31:03] (03PS1) 10Hnowlan: restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928) [12:34:09] (03PS14) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [12:36:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19449 and previous config saved to /var/cache/conftool/dbconfig/20220127-123617-root.json [12:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19450 and previous config saved to /var/cache/conftool/dbconfig/20220127-123701-root.json [12:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:45:22] (03CR) 10Majavah: [C: 04-1] upgrade codfw1dev to bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [12:50:03] (03PS14) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [12:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19451 and previous config saved to /var/cache/conftool/dbconfig/20220127-125120-root.json [12:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] (03CR) 10Zabe: [C: 03+1] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [12:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19452 and previous config saved to /var/cache/conftool/dbconfig/20220127-125205-root.json [12:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:54] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: get_cluster_status: output in yaml format [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757650 [12:55:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:55:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:55:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19453 and previous config saved to /var/cache/conftool/dbconfig/20220127-125538-marostegui.json [12:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:43] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [12:56:02] (03PS2) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [12:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19454 and previous config saved to /var/cache/conftool/dbconfig/20220127-125644-marostegui.json [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:42] (03PS3) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [13:00:12] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: get_cluster_status: output in yaml format [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757650 (owner: 10Arturo Borrero Gonzalez) [13:02:42] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33474/console" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:04:21] (03PS4) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [13:05:09] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:05:35] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19455 and previous config saved to /var/cache/conftool/dbconfig/20220127-130624-root.json [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:57] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 669561024048 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19456 and previous config saved to /var/cache/conftool/dbconfig/20220127-130708-root.json [13:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:37] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33475/console" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:09:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:01] (03PS15) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [13:10:33] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:11:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19457 and previous config saved to /var/cache/conftool/dbconfig/20220127-131148-marostegui.json [13:11:51] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 670433700048 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:16:41] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:19:51] (03CR) 10JMeybohm: [C: 03+2] Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:20:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33477/console" [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:20:36] (03CR) 10Elukey: [V: 03+1 C: 03+1] kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:21:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:21:09] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:21:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19458 and previous config saved to /var/cache/conftool/dbconfig/20220127-132128-root.json [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19459 and previous config saved to /var/cache/conftool/dbconfig/20220127-132212-root.json [13:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:23:12] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:26:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:26:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19460 and previous config saved to /var/cache/conftool/dbconfig/20220127-132624-marostegui.json [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19461 and previous config saved to /var/cache/conftool/dbconfig/20220127-132653-marostegui.json [13:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:51] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:28:05] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:28:38] (03PS2) 10JMeybohm: Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) [13:28:40] (03PS2) 10JMeybohm: Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967) [13:29:08] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [13:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:56] (03CR) 10JMeybohm: Add k8s masters in codfw eBGP config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:30:27] (03CR) 10JMeybohm: [C: 03+2] Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:31:00] (03Merged) 10jenkins-bot: Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:32:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.provision for host es1023.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:14] (03PS2) 10Muehlenhoff: Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632 [13:34:53] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:35:25] (03CR) 10Muehlenhoff: [C: 03+2] Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632 (owner: 10Muehlenhoff) [13:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19462 and previous config saved to /var/cache/conftool/dbconfig/20220127-133631-root.json [13:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19463 and previous config saved to /var/cache/conftool/dbconfig/20220127-133652-marostegui.json [13:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:57] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:37:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19464 and previous config saved to /var/cache/conftool/dbconfig/20220127-133715-root.json [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet [13:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:03] (03CR) 10Elukey: [V: 03+1 C: 03+2] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:41:51] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:41:52] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19465 and previous config saved to /var/cache/conftool/dbconfig/20220127-134158-marostegui.json [13:42:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:42:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:03] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [13:42:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:42:05] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:42:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19466 and previous config saved to /var/cache/conftool/dbconfig/20220127-134209-marostegui.json [13:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19467 and previous config saved to /var/cache/conftool/dbconfig/20220127-134315-marostegui.json [13:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1023.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:43:22] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] (03PS2) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) [13:44:54] (03PS2) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) [13:45:21] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:45:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1023.eqiad.wmnet with OS bullseye [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:53] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:46:07] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:46:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [13:46:33] !log imported elasticsearch-oss/kibana-oss/logstash-oss 6.8.23 to thirdparty/elastic68 for stretch and bullseye [13:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] PROBLEM - Disk space on kubemaster2002 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2002&var-datasource=codfw+prometheus/ops [13:47:16] that's me [13:47:29] BGP & disk space stuff [13:47:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33478/console" [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:48:00] (03PS4) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156) [13:48:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:48:40] (03PS7) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [13:50:19] (03Abandoned) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:50:30] (03PS1) 10Muehlenhoff: Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656 [13:51:12] (03CR) 10DCausse: [C: 03+1] Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656 (owner: 10Muehlenhoff) [13:51:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19468 and previous config saved to /var/cache/conftool/dbconfig/20220127-135157-marostegui.json [13:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [13:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [13:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:34] (03CR) 10Muehlenhoff: [C: 03+2] Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656 (owner: 10Muehlenhoff) [13:53:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:56] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:54:34] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 113, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:27] (03PS1) 10Ssingh: hieradata: add drmrs to Wikidough and durum sites [puppet] - 10https://gerrit.wikimedia.org/r/757657 (https://phabricator.wikimedia.org/T300156) [13:56:06] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:56:28] (03CR) 10Ssingh: [C: 03+2] hieradata: add drmrs to Wikidough and durum sites [puppet] - 10https://gerrit.wikimedia.org/r/757657 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh) [13:56:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [13:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19469 and previous config saved to /var/cache/conftool/dbconfig/20220127-135820-marostegui.json [13:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:28] PROBLEM - Disk space on kubemaster2001 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2001&var-datasource=codfw+prometheus/ops [14:01:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/757616 (owner: 10Muehlenhoff) [14:03:16] (03PS3) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) [14:03:19] (03PS3) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) [14:03:20] (03PS1) 10JMeybohm: Fix nrpe_check_disk_options hiera key for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757658 (https://phabricator.wikimedia.org/T290967) [14:03:47] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) [14:05:14] !log installing apache security updates [14:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:06] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:06:27] (03PS16) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [14:06:57] (03CR) 10JMeybohm: [C: 03+2] Fix nrpe_check_disk_options hiera key for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757658 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:07:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19470 and previous config saved to /var/cache/conftool/dbconfig/20220127-140702-marostegui.json [14:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:17] RECOVERY - Disk space on kubemaster2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2001&var-datasource=codfw+prometheus/ops [14:09:20] RECOVERY - Disk space on kubemaster2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2002&var-datasource=codfw+prometheus/ops [14:09:38] (03PS1) 10Elukey: mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) [14:10:38] (03CR) 10Elukey: "My understanding is that the class is used only to populate helmfile defaults, lemme know if this is not the case :)" [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:13:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19471 and previous config saved to /var/cache/conftool/dbconfig/20220127-141324-marostegui.json [14:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1023.eqiad.wmnet with OS bullseye [14:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:26] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:52] (03CR) 10Bking: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) (owner: 10DCausse) [14:22:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19473 and previous config saved to /var/cache/conftool/dbconfig/20220127-142206-marostegui.json [14:22:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:22:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:12] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19474 and previous config saved to /var/cache/conftool/dbconfig/20220127-142214-marostegui.json [14:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19475 and previous config saved to /var/cache/conftool/dbconfig/20220127-142517-ladsgroup.json [14:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:22] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [14:25:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1028.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19476 and previous config saved to /var/cache/conftool/dbconfig/20220127-142829-marostegui.json [14:28:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [14:28:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [14:28:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:28:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19477 and previous config saved to /var/cache/conftool/dbconfig/20220127-142841-marostegui.json [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19478 and previous config saved to /var/cache/conftool/dbconfig/20220127-143147-marostegui.json [14:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:18] 10SRE, 10Math, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: New mailing list for Wikimedia community group math - https://phabricator.wikimedia.org/T300239 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done. Create an account and go to https://lists.wikimedia.org/postorius/lists/math.lists.wikimedia.... [14:38:27] (03PS1) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 [14:39:29] !log added ganeti1028 to Ganeti eqiad cluster T293909 [14:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 [14:39:52] (03PS2) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 [14:39:56] (03CR) 10jerkins-bot: [V: 04-1] wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [14:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P19479 and previous config saved to /var/cache/conftool/dbconfig/20220127-144022-ladsgroup.json [14:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:51] (03PS3) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 [14:40:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1028.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:32] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh6002.wikimedia.org [14:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19480 and previous config saved to /var/cache/conftool/dbconfig/20220127-144652-marostegui.json [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:24] (03CR) 10Ladsgroup: sre.mysql.upgrade: various improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) (owner: 10Volans) [14:49:49] (03PS4) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) [14:50:09] (03PS1) 10DCausse: eventgate-main: update image to 2022-01-27-143826-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757667 (https://phabricator.wikimedia.org/T279541) [14:52:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main: update image to 2022-01-27-143826-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757667 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [14:52:39] herron: o/ as FYI I moved all rsyslog-kafka clients to the new ca bundle, everything looks good afaics, but lemme know if you see anything weird [14:53:00] elukey: great! thx for the heads up [14:53:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez would you kindly send your ssh public key via google chat, or via phabricator with the Add Action Sign with MFA option when you po... [14:54:15] !log continuing deployments of eventgate-main and eventgate-analytics to pick up CA cert changes - T296064 (also deploying eventgate-main for a schema repo bump for search) [14:54:16] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @Seddon thanks, I had missed that [14:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:20] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [14:54:42] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) [14:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P19481 and previous config saved to /var/cache/conftool/dbconfig/20220127-145527-ladsgroup.json [14:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:02] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply on production [14:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:05] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply on canary [14:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:33] ottomata: ^ [14:57:39] (03CR) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [14:57:59] hmm the bot said done but helm was still asking me to confirm [14:58:10] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync on production [14:58:11] hm [14:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:56] oh there's this "canary" thing [14:59:20] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh6002.wikimedia.org [14:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19482 and previous config saved to /var/cache/conftool/dbconfig/20220127-150156-marostegui.json [15:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:04] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on production [15:03:04] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on canary [15:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:01] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on canary [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:27] (03PS1) 10Btullis: Launch the script with a given process name [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733) [15:04:31] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on production [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:09] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672076262960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:05:20] (03Abandoned) 10Elukey: helmfile.d: add the istio pod security policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/746880 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [15:06:27] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33479/console" [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [15:07:02] (03PS1) 10MMandere: install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156) [15:07:13] (03CR) 10Michael DiPietro: upgrade codfw1dev to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [15:07:15] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on canary [15:07:15] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on production [15:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:15] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on canary [15:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:35] (03PS5) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) [15:09:41] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on production [15:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:04] (03CR) 10Ssingh: [C: 03+1] install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156) (owner: 10MMandere) [15:10:15] ottomata: all done [15:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19483 and previous config saved to /var/cache/conftool/dbconfig/20220127-151032-ladsgroup.json [15:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [15:10:51] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156) (owner: 10MMandere) [15:13:29] (03PS2) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) [15:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19484 and previous config saved to /var/cache/conftool/dbconfig/20220127-151701-marostegui.json [15:17:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:17:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:07] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [15:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19485 and previous config saved to /var/cache/conftool/dbconfig/20220127-151709-marostegui.json [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:02] (03CR) 10Volans: [C: 03+1] "LGTM, couple of very very minor nits inline, all optional." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [15:20:03] (03PS1) 10Elukey: eventstreams: move kafka config to new ca-bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/757672 (https://phabricator.wikimedia.org/T296064) [15:20:46] dcausse: yes thank you! [15:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19486 and previous config saved to /var/cache/conftool/dbconfig/20220127-152235-marostegui.json [15:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:40] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [15:27:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19487 and previous config saved to /var/cache/conftool/dbconfig/20220127-152717-marostegui.json [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:28] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [15:33:57] (03PS1) 10JHathaway: seddon: add ssh key & set kerberos to true [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) [15:34:45] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway) [15:35:03] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway) [15:37:13] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams: move kafka config to new ca-bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/757672 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:37:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19488 and previous config saved to /var/cache/conftool/dbconfig/20220127-153739-marostegui.json [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:49] (03CR) 10JHathaway: [C: 03+2] seddon: add ssh key & set kerberos to true [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway) [15:40:15] (03CR) 10Jbond: O:mail::mx: Add mx specific block list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [15:40:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @Seddon you ssh key has been updated and your kerberos principal has been created, please check your email for details. [15:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19489 and previous config saved to /var/cache/conftool/dbconfig/20220127-154222-marostegui.json [15:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) [15:44:11] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) just made the below comment on a tickety which i though may be usefull to capture here to give some context as to what we have to... [15:45:05] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @Miriam & @Ottomata please approve [15:45:09] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on production [15:45:09] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on canary [15:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:36] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on canary [15:45:37] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on production [15:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:19] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) >The intension of this CR is to slightly role back that decision and exclude the MX hosts from the abuse_nets ferm context rules... [15:48:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez confirmed their public key via gchat [15:48:14] (03CR) 10JHathaway: [C: 03+2] NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway) [15:48:25] (03PS2) 10JHathaway: NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) [15:48:32] (03CR) 10JHathaway: [V: 03+2 C: 03+2] NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway) [15:49:26] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) Thanks @jhathaway , approved on my end! [15:49:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez this change has been committed, should be ready to test in 30 or so minutes. [15:50:57] jouncebot now [15:50:57] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [15:51:00] jouncebot next [15:51:00] In 1 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1700) [15:52:21] !log train 1.38.0-wmf.19 (T293960): no current blockers; rolling train forward to group1 before log triage meeting [15:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:27] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [15:52:32] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply on canary [15:52:33] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply on production [15:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19490 and previous config saved to /var/cache/conftool/dbconfig/20220127-155244-marostegui.json [15:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:10] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) poked both approvers [15:53:12] (03PS1) 10Elukey: helmfile.d: add circuit breaking settings for ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414) [15:53:19] (03CR) 10Cwhite: centrallog: clean up old /srv/syslog/host directories after grace period (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [15:53:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) [15:53:33] (03PS1) 10Brennen Bearnes: group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676 [15:53:35] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676 (owner: 10Brennen Bearnes) [15:53:42] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on production [15:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:26] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway) [15:54:37] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on canary [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:54] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676 (owner: 10Brennen Bearnes) [15:56:20] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.19 refs T293960 [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] brennen: I'm around, ping me if anything goes sideways [15:57:12] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.19 refs T293960 (duration: 00m 51s) [15:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19491 and previous config saved to /var/cache/conftool/dbconfig/20220127-155726-marostegui.json [15:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:10] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway) a:03jhathaway [15:58:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:59:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:00:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:46] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:38] !log restarting blazegraph on wdqs1005 (jvm stuck for 2hours) [16:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [16:04:26] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672716303920 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:04:48] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:07:10] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:07:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19492 and previous config saved to /var/cache/conftool/dbconfig/20220127-160749-marostegui.json [16:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:54] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [16:12:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19493 and previous config saved to /var/cache/conftool/dbconfig/20220127-161231-marostegui.json [16:12:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:12:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:36] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19494 and previous config saved to /var/cache/conftool/dbconfig/20220127-161239-marostegui.json [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19495 and previous config saved to /var/cache/conftool/dbconfig/20220127-161344-marostegui.json [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:38] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply on canary [16:14:38] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply on production [16:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:02] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on canary [16:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:38] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on production [16:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:54] (03CR) 10Btullis: [V: 03+1 C: 03+2] Launch the script with a given process name [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [16:17:32] (03PS24) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [16:18:31] (03PS1) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) [16:19:16] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway) [16:19:28] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply on main [16:19:31] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply on canary [16:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:36] elukey: doing eventstreams ^^ [16:19:42] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [16:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:05] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync on main [16:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:16] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 525 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:24] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) a:03jhathaway [16:20:29] ottomata: ack! [16:20:33] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [16:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:39] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply on main [16:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:42] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply on canary [16:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:21] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Ottomata) Approved. @Miriam should this account have an expiry_date? [16:22:39] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync on main [16:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Ottomata) Also, I'm guessing this user will need Kerberos access too, correct? [16:23:14] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply on main [16:23:16] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply on canary [16:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:00] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [16:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:17] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: sync on main [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:27] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:25:50] (03CR) 10Jbond: [C: 04-1] Add Madalina Ana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway) [16:26:26] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10Volans) @aborrero thanks for opening this task! I had a chat with @jbond on what improvements we could make on our side to simply the integrati... [16:26:49] (03PS1) 10Btullis: Change the date at which the Movement Metrics tasks run [puppet] - 10https://gerrit.wikimedia.org/r/757679 (https://phabricator.wikimedia.org/T295733) [16:27:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10odimitrijevic) Approved. [16:27:31] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10odimitrijevic) Approved. [16:27:31] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply on production [16:27:32] (03PS2) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) [16:27:33] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply on canary [16:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:46] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: sync on production [16:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19496 and previous config saved to /var/cache/conftool/dbconfig/20220127-162849-marostegui.json [16:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) [16:30:13] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway) [16:31:39] (03CR) 10JHathaway: Grant skvjold access to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [16:32:27] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:32:30] (03CR) 10JHathaway: Grant skvjold access to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [16:34:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [16:35:42] (03PS2) 10JHathaway: Grant skvjold access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) [16:35:45] (03PS1) 10AOkoth: otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) [16:37:09] (03Abandoned) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan) [16:39:00] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to test it on netbox-next before merging it you want to be sure it works as expected." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [16:39:26] (03CR) 10JHathaway: [C: 03+2] Grant skvjold access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [16:43:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19497 and previous config saved to /var/cache/conftool/dbconfig/20220127-164354-marostegui.json [16:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) @MNovotny_WMF you should be all set [16:45:27] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:48:52] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [16:49:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:49:31] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:21] btullis: ^ [16:50:27] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on production [16:50:27] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on canary [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:09] Ah sorry. I thought I downtimed it, but I got the individual unit. I missed the sholw systemd. Please ingore it, I'll put it back. [16:51:30] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on production [16:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:52] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [16:56:05] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19498 and previous config saved to /var/cache/conftool/dbconfig/20220127-165859-marostegui.json [16:59:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:59:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:59:03] (03PS1) 10Btullis: Revert the change made for movement_metrics timer [puppet] - 10https://gerrit.wikimedia.org/r/757686 (https://phabricator.wikimedia.org/T295733) [16:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:04] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19499 and previous config saved to /var/cache/conftool/dbconfig/20220127-165907-marostegui.json [16:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase1020.eqiad.wmnet with reason: Firmware upgrade [16:59:58] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase1020.eqiad.wmnet with reason: Firmware upgrade [17:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] !log updating firmware ganeti1007 and ganeti1015 T299527 [17:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1700). [17:00:05] RoanKattouw: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:06] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10WDoranWMF) Approved. [17:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:10] T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 [17:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19500 and previous config saved to /var/cache/conftool/dbconfig/20220127-170013-marostegui.json [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:19] RoanKattouw: 👋 looking [17:00:22] (03CR) 10Btullis: [C: 03+2] Revert the change made for movement_metrics timer [puppet] - 10https://gerrit.wikimedia.org/r/757686 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [17:00:34] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on canary [17:00:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1020.eqiad.wmnet [17:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:11] (03CR) 10Andrew Bogott: [C: 03+1] "this looks good to me -- I agree that we should simply drop all python2 references until we find a case where that breaks something." [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [17:01:15] !log updating firmware restbase1020 T299652 [17:01:18] RoanKattouw: can you get a +1 from someone more familiar please? I'm happy to deploy but I don't want to be the only reviewer on this :) [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:20] T299652: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 [17:01:31] RoanKattouw: (no need to get it done within the 30min window, ping me any time) [17:02:55] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673623772720 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:02:57] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:27] ACKNOWLEDGEMENT - Device not healthy -SMART- on restbase2010 is CRITICAL: cluster=restbase device={sde,sdf,sdg,sdh,sdi,sdj} instance=restbase2010 job=node site=codfw Hnowlan Devices not part of filesystems, host to be decommissioned https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase2010&var-datasource=codfw+prometheus/ops [17:07:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 674250804784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:07:32] (03CR) 10Dzahn: [C: 03+1] "compared in private repo, good to go!" [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:08:16] (03CR) 10AOkoth: [C: 03+2] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:08:32] (03CR) 10AOkoth: [V: 03+2 C: 03+2] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:08:41] (03CR) 10Dzahn: [V: 03+2 C: 03+1] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:08:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [17:09:05] (03CR) 10Volans: "Did a first pass, I skipped the test file, I'll get back to it later. Most comments are nits/typos" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:09:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:10:09] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:53] (03PS4) 10Dzahn: otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:14:55] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) [17:15:11] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [17:15:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1020.eqiad.wmnet [17:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19501 and previous config saved to /var/cache/conftool/dbconfig/20220127-171518-marostegui.json [17:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:29] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff both 1007/1015 are updated [17:16:30] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 674901036592 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:17:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase1021.eqiad.wmnet with reason: Firmware upgrade [17:17:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase1021.eqiad.wmnet with reason: Firmware upgrade [17:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:39] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1021.eqiad.wmnet [17:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:43] !log updating firmware restbase1021 T299652 [17:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:47] T299652: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 [17:21:55] (03CR) 10Dzahn: [C: 03+1] "yes, reviewed and compiled. currently puppet is broken on otrs1001 and this change fixes it" [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:21:57] (03CR) 10Volans: "Did a first pass, nothing major, just nits." [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:22:20] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:22:22] (03CR) 10AOkoth: [C: 03+2] otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:12] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33481/" [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:24:27] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) [17:29:31] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) ssh key confirmed via gchat [17:30:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19502 and previous config saved to /var/cache/conftool/dbconfig/20220127-173022-marostegui.json [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:01] rzl: Sure, I'll ask Scott if he's willing to +1 again [17:31:35] RoanKattouw: thanks! sorry for the extra runaround [17:33:41] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1021.eqiad.wmnet [17:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:57] (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [17:33:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1022.eqiad.wmnet with reason: Firmware upgrade [17:34:00] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1022.eqiad.wmnet with reason: Firmware upgrade [17:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1022.eqiad.wmnet [17:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:43] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:34:50] (03PS1) 10Ladsgroup: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483 [17:34:56] (03PS2) 10Ladsgroup: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483 [17:35:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483 (owner: 10Ladsgroup) [17:37:06] (03PS1) 10JHathaway: Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688) [17:39:17] 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) p:05Triage→03High [17:39:27] 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [17:41:33] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19503 and previous config saved to /var/cache/conftool/dbconfig/20220127-174527-marostegui.json [17:45:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:45:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:33] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [17:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [17:45:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [17:45:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [17:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [17:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:45:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:45:58] brennen: jeena: fyi https://phabricator.wikimedia.org/T299289#7657159 [17:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19504 and previous config saved to /var/cache/conftool/dbconfig/20220127-174606-marostegui.json [17:46:07] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:34] (03CR) 10RLazarus: [C: 03+1] Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688) (owner: 10JHathaway) [17:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19505 and previous config saved to /var/cache/conftool/dbconfig/20220127-174712-marostegui.json [17:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:31] (03CR) 10JHathaway: [C: 03+2] Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688) (owner: 10JHathaway) [17:47:33] 10SRE, 10Traffic-Icebox, 10serviceops: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10RLazarus) 05Open→03Resolved a:03RLazarus Good news! This is long since done, tidying it up. [17:47:51] (03CR) 10JHathaway: [C: 03+2] Add Madalina Ana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway) [17:48:16] taavi: thanks. [17:48:30] (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [17:48:32] it doesn't affect group2, not sure if rollback worthy [17:51:04] (03PS3) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) [17:51:32] (03CR) 10SBassett: [C: 03+1] doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [17:52:39] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1022.eqiad.wmnet [17:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:52] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1023.eqiad.wmnet with reason: Firmware upgrade [17:52:52] (03CR) 10JHathaway: [C: 03+2] Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway) [17:52:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1023.eqiad.wmnet with reason: Firmware upgrade [17:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1023.eqiad.wmnet [17:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:51] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) I chat with @Jgreen in IRC we will be doing the upgrade next week on Tuesday the 1st at 10:30 CT [17:55:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) @FGoodwin this should now be setup, please give it a go! [17:56:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway) @Madalina this should now be setup, please give it a try [17:57:02] (03PS6) 10Krinkle: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [17:57:22] (03PS7) 10Krinkle: multiversion: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [17:57:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:57:40] (03CR) 10Krinkle: [C: 03+1] "Good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [17:58:17] (03PS1) 10Brennen Bearnes: Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289) [17:58:38] (03CR) 10Krinkle: [C: 04-1] "Per chat (clearing from review queue)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [17:58:44] (03PS3) 10Gergő Tisza: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan) [17:59:30] (03PS1) 10JMeybohm: Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) [18:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1800). [18:00:22] (03CR) 10JMeybohm: "߷" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [18:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19506 and previous config saved to /var/cache/conftool/dbconfig/20220127-180217-marostegui.json [18:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:47] (03CR) 10Elukey: [C: 03+1] Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [18:06:03] RoanKattouw: thanks -- let me know when's a good time, I'll deploy and you can test [18:06:39] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675789368880 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:06:57] rzl: Ready whenever you are [18:07:03] 👍 [18:07:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1023.eqiad.wmnet [18:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1024.eqiad.wmnet with reason: Firmware upgrade [18:07:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1024.eqiad.wmnet with reason: Firmware upgrade [18:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:21] (03CR) 10RLazarus: [C: 03+2] doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [18:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1024.eqiad.wmnet [18:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:40] RoanKattouw: done, should be at all doc* hosts [18:10:41] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [18:13:28] (03PS1) 10Bking: deployment-prep: add cergen config for elastic service [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) [18:13:46] rzl: It's working, thanks! [18:13:51] (03PS3) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [18:13:51] \o/ [18:15:05] (03CR) 10Majavah: deployment-prep: add cergen config for elastic service (032 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [18:16:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:16:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19507 and previous config saved to /var/cache/conftool/dbconfig/20220127-181656-marostegui.json [18:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:01] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [18:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19508 and previous config saved to /var/cache/conftool/dbconfig/20220127-181722-marostegui.json [18:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 676650512232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:20:11] going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/757485 [18:20:33] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289) (owner: 10Brennen Bearnes) [18:20:34] !log mdipietro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS bullseye [18:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1024.eqiad.wmnet [18:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase[1025-1027].eqiad.wmnet with reason: Firmware upgrade [18:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase[1025-1027].eqiad.wmnet with reason: Firmware upgrade [18:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase102[5-7].eqiad.wmnet [18:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:08] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on production [18:25:08] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on canary [18:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:11] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply on production [18:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:28] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on canary [18:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19509 and previous config saved to /var/cache/conftool/dbconfig/20220127-182627-marostegui.json [18:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:32] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [18:27:39] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [18:28:38] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) I've drafted the directions for remote hands, translating the above diagram to a step by step direction for them to rack our routers... [18:29:52] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 676743411344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:30:40] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19510 and previous config saved to /var/cache/conftool/dbconfig/20220127-183226-marostegui.json [18:32:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:32:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:31] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [18:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19511 and previous config saved to /var/cache/conftool/dbconfig/20220127-183234-marostegui.json [18:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19512 and previous config saved to /var/cache/conftool/dbconfig/20220127-183340-marostegui.json [18:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] (03Merged) 10jenkins-bot: Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289) (owner: 10Brennen Bearnes) [18:38:26] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) [18:40:26] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [18:40:37] (03PS1) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [18:41:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19513 and previous config saved to /var/cache/conftool/dbconfig/20220127-184132-marostegui.json [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:44] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase102[5-7].eqiad.wmnet [18:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:15] (03CR) 10Ryan Kemper: elastic: install elasticsearch-oss from component (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [18:42:20] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.19/extensions/WikibaseMediaInfo: Backport: [[gerrit:757485|Revert "Escape various messages in WikibaseMediaInfo" (T299289)]] (duration: 00m 52s) [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:32] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [18:43:06] (03Abandoned) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:43:07] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply on production [18:43:08] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply on canary [18:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:43:25] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync on canary [18:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:15] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync on production [18:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:47:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19514 and previous config saved to /var/cache/conftool/dbconfig/20220127-184845-marostegui.json [18:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:53] (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254) [18:51:39] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) a:05Cmjohnson→03Papaul [18:51:46] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [18:51:50] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) assigned this to @papaul for codfw portion of the task, removed the ops-eqiad. [18:52:09] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) a:05RobH→03ayounsi [18:55:14] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) @Andrew How is this looking so far? [18:55:31] (03CR) 10Andrew Bogott: [C: 03+1] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [18:56:21] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Cmjohnson) [18:56:23] (03PS1) 10Andrew Bogott: cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138) [18:56:35] (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [18:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19515 and previous config saved to /var/cache/conftool/dbconfig/20220127-185637-marostegui.json [18:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:45] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Cmjohnson) 05Open→03Resolved The switch has been relocated by @Jclark-ctr [18:59:35] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add new bullseye eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/757397 (owner: 10Majavah) [19:00:04] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1900). [19:00:04] tgr: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:31] (03CR) 10Ebernhardson: elastic: install elasticsearch-oss from component (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [19:01:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) so far so good! [19:02:46] o/ [19:03:01] I can self-serve [19:03:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19516 and previous config saved to /var/cache/conftool/dbconfig/20220127-190349-marostegui.json [19:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:20] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan) [19:06:03] (03Merged) 10jenkins-bot: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan) [19:07:42] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 677504224816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:09:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:12] (03PS1) 10Cmjohnson: Updating netboot.cfg to reflect change for cloudbackup1003[4] [puppet] - 10https://gerrit.wikimedia.org/r/757705 (https://phabricator.wikimedia.org/T293934) [19:10:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:54] (03PS2) 10Andrew Bogott: cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138) [19:11:02] (03CR) 10Cmjohnson: [C: 03+2] Updating netboot.cfg to reflect change for cloudbackup1003[4] [puppet] - 10https://gerrit.wikimedia.org/r/757705 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson) [19:11:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19517 and previous config saved to /var/cache/conftool/dbconfig/20220127-191141-marostegui.json [19:11:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:46] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:50] (03CR) 10Dzahn: [C: 03+2] httpbb: move tests for static-bugzilla to new file for miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:13:14] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138) (owner: 10Andrew Bogott) [19:13:32] in before merge conflict [19:16:43] (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254) [19:16:45] (03PS1) 10DLynch: Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308) [19:16:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:16:48] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 678217648432 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:18:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:09] (03CR) 10Andrew Bogott: [C: 03+1] "If that's where it is, then that's where it is" [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [19:18:31] (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [19:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19518 and previous config saved to /var/cache/conftool/dbconfig/20220127-191854-marostegui.json [19:18:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [19:18:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [19:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:59] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19519 and previous config saved to /var/cache/conftool/dbconfig/20220127-191902-marostegui.json [19:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:13] tgr: hey! Can you let me know when you're done with deployment? [19:19:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:33] Is the backport window still sufficiently open that I could slip a config patch into it, or should I go sign up for the next window? [19:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19520 and previous config saved to /var/cache/conftool/dbconfig/20220127-192009-marostegui.json [19:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:41] Kemayo: i feel like it should be possible to do your patch now (once tgr finishes) [19:21:10] urbanecm: Nifty. In that case, it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757730 [19:21:33] Kemayo: noted. Can you add it to the calendar too? :)) [19:23:44] urbanecm: Okay, it's added. [19:25:58] (03CR) 10Dzahn: "old tests fixed. new tests work on k8s: [deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb-k8s.yaml --hosts miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:26:14] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 678217648656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:27:56] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752657|GrowthExperiments: Start add image experiment for desktop users (T298122)]] (duration: 00m 51s) [19:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:02] T298122: Add an image: experiment (desktop) - https://phabricator.wikimedia.org/T298122 [19:28:17] > 19:27:37 Check 'Logstash Error rate for mw1450.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [19:28:25] sounds foreboding [19:28:47] (03CR) 10Dzahn: "Jaime, the role that used this fileset has been removed. static-bugzilla moved from ganeti/puppet to kubernetes. The data is stored in a s" [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:29:33] (03Abandoned) 10Dzahn: miscweb: bump version to 2022-01-25-150544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757060 (owner: 10Dzahn) [19:30:02] would be nice to have a logstash link in that message. [19:30:34] probably :) [19:30:52] but if it's the only server with higher rate, i'd say it's fine [19:30:54] it can be a temp fluke [19:30:55] the mediawiki-errors dashboard at least gives 0 errors for that host. [19:31:00] good [19:31:54] I'll call it done [19:32:09] okay [19:32:12] so can i take over tgr now? [19:32:25] yes, thx [19:32:37] Kemayo: hey! We can do your patch now. Still around? [19:32:54] urbanecm: I am ready [19:33:05] great! I'll start and let you know when ready for testing [19:33:10] we do have a metric ton of 'DBConnRef::numRows was deprecated' errors on other hosts - looks like that deprecation was a bit premature [19:33:57] (03CR) 10Urbanecm: [C: 03+2] Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:34:40] (03Merged) 10jenkins-bot: Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:34:57] Kemayo: it's at mwdebug1001, can you test please? [19:35:05] Sure thing, one second. [19:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19521 and previous config saved to /var/cache/conftool/dbconfig/20220127-193514-marostegui.json [19:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:41] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Jdforrester-WMF) 05Open→03Resolved Success: {F34933393, size=full} [19:36:14] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:36:18] !log purging font* / xfont* packages from further eqiad appservers (mw14*) for T294378 [19:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:22] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [19:37:31] urbanecm: Okay, looks good! (Sorry, it took a second because it had logged in and logged out bits to test.) [19:37:39] no problem :) [19:37:41] syncing [19:38:52] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2c8561c1c0aa6b4f5f8202972b7b28723337e88e: Launch DiscussionTools new topic tool a/b test (T291308) (duration: 00m 51s) [19:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] T291308: Make config change to start New Discussion Tool A/B Test - https://phabricator.wikimedia.org/T291308 [19:39:17] Kemayo: should be live! [19:39:18] anything else? [19:39:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:02] urbanecm: nothing else, thanks! [19:40:12] np [19:40:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:40:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:12] !log purging font packages from wtp* (parsoid eqiad) [19:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:49] !log purging font packages from parse* (parsoid codfw) [19:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:09] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) purged from all of parsoid (wtp* and parse*) and the rest of eqiad (mw14*) [19:46:07] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen) [19:47:07] (03PS2) 10Urbanecm: Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) [19:47:37] (03CR) 10Urbanecm: [C: 03+2] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [19:48:04] (03PS1) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) [19:48:06] (03PS1) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) [19:48:08] (03PS1) 10Jdlrobson: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) [19:48:17] (03Merged) 10jenkins-bot: Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [19:49:04] (03CR) 10jerkins-bot: [V: 04-1] Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:49:14] (03CR) 10jerkins-bot: [V: 04-1] Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:49:24] (03CR) 10jerkins-bot: [V: 04-1] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:50:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19524 and previous config saved to /var/cache/conftool/dbconfig/20220127-195019-marostegui.json [19:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:43] (03PS2) 10Jdlrobson: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) [19:50:51] (03CR) 10Jdlrobson: "Clare can you check this one and deploy it along with the idwiki change ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:51:02] (03PS2) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) [19:51:15] (03PS2) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) [19:52:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:37] (03PS3) 10Urbanecm: Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243) [19:52:49] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6fa62c58c04929d7327d8f07dbd32b6139f58ccf: Do not set wgTrustedXffFile (T298243) (duration: 00m 51s) [19:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:54] T298243: Finish removal of wgTrustedXffFile - https://phabricator.wikimedia.org/T298243 [19:53:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:53:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:24] (03CR) 10Urbanecm: [C: 03+2] Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [19:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:59] (03Merged) 10jenkins-bot: Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm) [19:58:39] !log urbanecm@deploy1002 Synchronized docroot/noc/: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 1/3) (duration: 00m 50s) [19:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:43] T298243: Finish removal of wgTrustedXffFile - https://phabricator.wikimedia.org/T298243 [19:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:56] !log urbanecm@deploy1002 Synchronized wmf-config/: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 2/3) (duration: 00m 51s) [20:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T2000). [20:00:27] o/ [20:00:43] brennen: just a last sync-file remaining [20:00:43] sorry! [20:00:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:00:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:49] !log urbanecm@deploy1002 Synchronized phpcs.xml: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 3/3) (duration: 00m 50s) [20:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:55] urbanecm: no rush! [20:01:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:02] brennen: I'm done now :) [20:02:35] cool, thanks. rolling train. [20:03:12] thanks [20:03:22] !log train 1.38.0-wmf.19 (T293960): no current blockers; logs clean-ish, rolling train forward to group2 [20:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:29] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [20:04:10] (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738 [20:04:12] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738 (owner: 10Brennen Bearnes) [20:04:54] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738 (owner: 10Brennen Bearnes) [20:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19525 and previous config saved to /var/cache/conftool/dbconfig/20220127-200523-marostegui.json [20:05:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:05:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:35] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [20:05:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19526 and previous config saved to /var/cache/conftool/dbconfig/20220127-200535-marostegui.json [20:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:19] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.19 refs T293960 [20:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19527 and previous config saved to /var/cache/conftool/dbconfig/20220127-200641-marostegui.json [20:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:36] (03PS1) 10Majavah: openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254) [20:08:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:08:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:06] 10SRE: Issue installing ca-certificates-java - https://phabricator.wikimedia.org/T300300 (10colewhite) [20:11:31] 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite) [20:11:55] 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite) [20:13:03] (03PS1) 10Ssingh: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) [20:15:38] (03PS2) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [20:17:32] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [20:17:53] (03PS1) 10Majavah: backy2: don't install python3-crypto in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757742 (https://phabricator.wikimedia.org/T300254) [20:21:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19528 and previous config saved to /var/cache/conftool/dbconfig/20220127-202145-marostegui.json [20:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:30] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10FGoodwin) I'm set up, thanks so much! [20:27:46] (03CR) 10Andrew Bogott: [C: 03+2] backy2: don't install python3-crypto in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757742 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [20:29:11] (03CR) 10Michael DiPietro: [C: 03+1] openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [20:32:16] (03PS3) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [20:32:38] (03CR) 10Ryan Kemper: "Removed the befores" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [20:36:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19529 and previous config saved to /var/cache/conftool/dbconfig/20220127-203650-marostegui.json [20:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:34] (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254) [20:37:56] 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) 05Open→03Resolved great! [20:38:04] (03CR) 10Andrew Bogott: [C: 03+1] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [20:39:14] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [20:39:41] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10lmata) [20:39:54] (03CR) 10Michael DiPietro: [C: 03+2] openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [20:40:23] (03Abandoned) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [20:40:26] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10lmata) [20:46:12] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 680001474560 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19530 and previous config saved to /var/cache/conftool/dbconfig/20220127-205155-marostegui.json [20:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:01] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [21:03:31] 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10hnowlan) More context for this issue in T289694 [21:19:22] (03PS1) 10Volans: management: remove deprecated module [software/spicerack] - 10https://gerrit.wikimedia.org/r/757747 [21:26:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [21:36:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [21:41:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [21:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:22] (03PS4) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [22:01:23] (03CR) 10JHathaway: "Filippo would you kindly take a look at the reworked alert config" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [22:04:26] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:52:22] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10Andrew) [22:57:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS buster [22:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.wikimedia.org with OS buster [23:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:01] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS buster [23:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:51] (03PS1) 10Bking: wcqs: populate journal var to fix puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:13:30] (03PS2) 10Ryan Kemper: wcqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:17:01] (03PS3) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:17:03] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [23:18:02] (03PS4) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:20:41] (03PS5) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:21:09] (03PS6) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:21:51] (03CR) 10Bking: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:21:53] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.wikimedia.org with OS buster [23:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS bullseye [23:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:25:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [23:30:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [23:32:19] (03PS7) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) [23:32:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:41:04] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:41:10] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:48:08] (03PS1) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757772 [23:49:38] (03PS2) 10Bking: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) [23:49:52] (03PS3) 10Ryan Kemper: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:49:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:50:19] (03PS4) 10Ryan Kemper: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:50:54] (03PS1) 10Cwhite: hiera: set domainrw to grafana-next-rw in codfw [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863) [23:50:56] (03PS1) 10Cwhite: graphite: add grafana-next-rw to cors origins [puppet] - 10https://gerrit.wikimedia.org/r/757775 (https://phabricator.wikimedia.org/T282863) [23:51:00] (03PS1) 10Cwhite: idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) [23:51:02] (03PS1) 10Cwhite: hiera: add grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) [23:51:04] (03PS1) 10Cwhite: hiera: configure mapping and cache rules for grafana-next-rw [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863) [23:51:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [23:53:05] (03PS1) 10Cwhite: wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) [23:57:00] (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [23:59:03] (03CR) 10Clare Ming: [C: 03+1] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)