[00:00:05] <jouncebot>	 RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T0000).
[00:00:05] <jouncebot>	 nn1l2: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:09] <nn1l2>	 hi
[00:01:48] <RoanKattouw>	 Hello! I'm out on a walk so I can't do the deployment right now. I hope someone else is around to do it; if not, I'll be home in 30 mins
[00:02:18] <nn1l2>	 Thanks!
[00:02:24] <James_F>	 We're back-porting a train blocker right now so deployments probably should be on hold anyway.
[00:03:07] <nn1l2>	 no rush, I can wait for the whole window (60 mins)
[00:03:24] <nn1l2>	 just ping me when ready, please
[00:08:36] <brennen>	 CI is in a bad state; this window may or may not happen.
[00:08:56] <brennen>	 ^ nn1l2 
[00:09:27] <nn1l2>	 thanks for the notification!
[00:09:39] <nn1l2>	 Anyway, I will be still around
[00:09:49] <brennen>	 sure thing; apologies for the interruption in service.
[00:21:15] <icinga-wm>	 PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:24:41] <thcipriani>	 !log restarting jenkins
[00:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:25] <wikibugs>	 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) This is a good step forward.  Thank you!  I realize deployment-prep may not be in scope for this project, but we have a vested interest in keeping [[ https://beta-logs.wikimedia.org...
[00:37:28] <RoanKattouw>	 I'm back now
[00:37:39] <RoanKattouw>	 brennen: Is the CI failure specific to wmf.19?
[00:38:13] <brennen>	 RoanKattouw: it is not, but turning it off and back on seems to have resolved CI issues for the moment
[00:39:33] <RoanKattouw>	 OK. Should I use the config change for this window as a guinea pig for that, or would you prefer that we delay that change to tomorrow?
[00:40:40] <brennen>	 RoanKattouw: config change for this window seems fine, go ahead.  i'm also (cc: James_F) good with landing the wmf.19 patch, but will hold rolling train forward until tomorrow.
[00:40:54] <RoanKattouw>	 Great, thanks
[00:41:06] <brennen>	 i'd prefer to have whatever else is going to break during wmf.19 on group1 do so during someone's actual workday. :)
[00:41:09] <RoanKattouw>	 nn1l2: Alright, we're in business, I'll have your config patch ready for testing in a few minutes
[00:41:26] <James_F>	 brennen: Ack.
[00:41:29] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757549 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2)
[00:41:29] <nn1l2>	 thanks!
[00:42:08] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite)
[00:42:25] <wikibugs>	 (03Merged) 10jenkins-bot: commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757549 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2)
[00:47:15] <RoanKattouw>	 nn1l2: Alright, test away on mwdebug1002
[00:48:43] <nn1l2>	 test failed :(
[00:48:45] <nn1l2>	 HTTP request timed out.
[00:48:46] <nn1l2>	 There was a problem during the HTTP request: 0 Error
[00:49:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:50:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:51] <nn1l2>	 probably similar to https://phabricator.wikimedia.org/T299247#7628032
[00:51:22] <nn1l2>	 second test failed too
[00:51:24] <nn1l2>	 I think we can revert
[00:51:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:51] <nn1l2>	 before reverting, maybe we can test it on mwdebug1001?
[00:53:27] <nn1l2>	 RoanKattouw, you still there?
[00:53:51] <RoanKattouw>	 Yes, I'm back. Sorry, I got distracted with something lese
[00:54:00] <RoanKattouw>	 Let's try 1001
[00:54:32] <RoanKattouw>	 nn1l2: OK, 1001 is ready for testing
[00:55:43] <nn1l2>	 test failed again :(
[00:55:55] <nn1l2>	 Can we postpone reverting to tommorow?
[00:56:11] <nn1l2>	 I want to test it again some hours later
[00:56:45] <nn1l2>	 problems with Iranian urls are common :-(
[00:57:05] <RoanKattouw>	 Hah I can imagine
[00:57:23] <nn1l2>	 Thanks!
[00:57:29] <RoanKattouw>	 So the only broken thing is that if you try to upload from this URL you get an error, but otherwise the site works, right?
[00:57:39] <nn1l2>	 So, I think we are done here
[00:57:41] <RoanKattouw>	 If so, I can deploy this, and we can tweak/revert another day
[00:57:43] <nn1l2>	 yes exactly
[00:57:47] <RoanKattouw>	 OK
[00:58:58] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757549|commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist (T300217)]] (duration: 00m 55s)
[00:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:03] <stashbot>	 T300217: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300217
[01:00:04] <jouncebot>	 twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T0100).
[01:05:08] <wikibugs>	 (03CR) 10Brennen Bearnes: "recheck" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup)
[01:11:31] <nn1l2>	 RoanKattouw: just to be sure, the patch got deployed. Yeah?
[01:11:40] <RoanKattouw>	 Yes it did
[01:11:44] <nn1l2>	 Thanks!
[01:14:38] <brennen>	 hrm, evidently recheck comment doesn't do what i expect here.
[01:17:50] <Krinkle>	 it seems something is making files dissappear from that jenkins agent independant and outside of the job execution
[01:18:00] <Krinkle>	 half-way through a random CLI command fails with MEssageEn.php missing
[01:18:15] <Krinkle>	 and after that the workspace clean up command fails with shell not finding any files anywhere
[01:18:32] <Krinkle>	 or maybe docker losing track of the harddrive
[01:18:36] <brennen>	 Krinkle: see #-releng backscroll; there were multiple agents running due to a misconfiguration.
[01:18:51] <Krinkle>	 ack
[01:19:10] <Krinkle>	 it's running now fwiw https://integration.wikimedia.org/zuul/#q=757473
[01:19:23] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.23 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:21:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 103.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:22:29] <icinga-wm>	 RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:26:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) >>! In T299988#7654203, @jhathaway wrote: > @MarkTraceur & @Ottomata please approve  Just to note that @MarkTraveur approved above     >>! In T299988#7649407, @MarkTr...
[01:33:40] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 04-2] "Will deploy this in local morning before rolling wmf.19 to group1." [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup)
[01:34:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[01:35:45] * brennen calls it a day.
[01:39:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[02:05:29] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:14:01] <wikibugs>	 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul)
[04:01:44] <Krinkle>	 !log grafana: Temporarily silence resourceloader alert for INM satisfaction ratio, pending T298520. 
[04:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:01:50] <stashbot>	 T298520: Investigate INM Satisfaction alert as of 2021-12-17 - https://phabricator.wikimedia.org/T298520
[04:14:35] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:27] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:31:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:36:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[06:07:39] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2022-01-26 11:06:33 (2674 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[06:22:35] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.742e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1
[06:54:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[06:54:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[06:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19376 and previous config saved to /var/cache/conftool/dbconfig/20220127-065406-marostegui.json
[06:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:11] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[06:55:10] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot Cole!" [puppet] - 10https://gerrit.wikimedia.org/r/757535 (owner: 10Cwhite)
[06:55:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19377 and previous config saved to /var/cache/conftool/dbconfig/20220127-065519-marostegui.json
[06:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:55:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[07:04:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[07:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19378 and previous config saved to /var/cache/conftool/dbconfig/20220127-070428-marostegui.json
[07:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:34] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[07:04:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757475
[07:05:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es1021', diff saved to https://phabricator.wikimedia.org/P19379 and previous config saved to /var/cache/conftool/dbconfig/20220127-070532-marostegui.json
[07:05:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19380 and previous config saved to /var/cache/conftool/dbconfig/20220127-070557-marostegui.json
[07:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:01] <stashbot>	 T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127
[07:06:05] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:06:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757475 (owner: 10Marostegui)
[07:07:35] <wikibugs>	 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) When I worked with John on T296089 we wanted to give a way to deploy bundles across realms, so it is in scope to migrate deployment-prep as well if it is a critical piece of your testin...
[07:08:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 T299479', diff saved to https://phabricator.wikimedia.org/P19381 and previous config saved to /var/cache/conftool/dbconfig/20220127-070821-marostegui.json
[07:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:26] <stashbot>	 T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479
[07:08:27] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[07:08:57] <wikibugs>	 (03PS1) 10Marostegui: db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757558 (https://phabricator.wikimedia.org/T299479)
[07:10:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757558 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui)
[07:10:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19382 and previous config saved to /var/cache/conftool/dbconfig/20220127-071023-marostegui.json
[07:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1131.eqiad.wmnet with OS bullseye
[07:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19383 and previous config saved to /var/cache/conftool/dbconfig/20220127-071355-marostegui.json
[07:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:00] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[07:17:03] <logmsgbot>	 !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1131.eqiad.wmnet with OS bullseye
[07:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1131.eqiad.wmnet with OS bullseye
[07:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19384 and previous config saved to /var/cache/conftool/dbconfig/20220127-072528-marostegui.json
[07:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19385 and previous config saved to /var/cache/conftool/dbconfig/20220127-072900-marostegui.json
[07:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19386 and previous config saved to /var/cache/conftool/dbconfig/20220127-074033-marostegui.json
[07:40:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:40:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:38] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[07:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:40:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[07:40:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[07:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19387 and previous config saved to /var/cache/conftool/dbconfig/20220127-074101-marostegui.json
[07:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19388 and previous config saved to /var/cache/conftool/dbconfig/20220127-074214-marostegui.json
[07:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19389 and previous config saved to /var/cache/conftool/dbconfig/20220127-074404-marostegui.json
[07:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1131.eqiad.wmnet with OS bullseye
[07:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:03] <wikibugs>	 (03PS1) 10Ladsgroup: Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194)
[07:52:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19390 and previous config saved to /var/cache/conftool/dbconfig/20220127-075229-root.json
[07:52:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757477
[07:53:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup)
[07:55:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757477 (owner: 10Marostegui)
[07:57:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19391 and previous config saved to /var/cache/conftool/dbconfig/20220127-075718-marostegui.json
[07:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T285149)', diff saved to https://phabricator.wikimedia.org/P19392 and previous config saved to /var/cache/conftool/dbconfig/20220127-075909-marostegui.json
[07:59:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[07:59:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[07:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:14] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[07:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:13] <wikibugs>	 (03Abandoned) 10Ladsgroup: Revert "rdbms: cleanup the use of QUERY_ flags to query() in Database" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup)
[08:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19393 and previous config saved to /var/cache/conftool/dbconfig/20220127-080733-root.json
[08:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[08:07:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[08:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:04] <wikibugs>	 (03Merged) 10jenkins-bot: Don't consider lock waits to be write queries [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757476 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup)
[08:12:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19394 and previous config saved to /var/cache/conftool/dbconfig/20220127-081223-marostegui.json
[08:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:43] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.19/includes/libs/rdbms/database/Database.php: Backport: [[gerrit:757476|Don't consider lock waits to be write queries (T300194)]] (duration: 00m 52s)
[08:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:47] <stashbot>	 T300194: Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 3.6s in writes, exceeding the 3s limit - https://phabricator.wikimedia.org/T300194
[08:16:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[08:16:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[08:16:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19395 and previous config saved to /var/cache/conftool/dbconfig/20220127-081622-marostegui.json
[08:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:27] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[08:16:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[08:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[08:17:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[08:17:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[08:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:07] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[08:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[08:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19396 and previous config saved to /var/cache/conftool/dbconfig/20220127-082236-root.json
[08:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19397 and previous config saved to /var/cache/conftool/dbconfig/20220127-082728-marostegui.json
[08:27:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:27:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[08:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:33] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[08:27:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19398 and previous config saved to /var/cache/conftool/dbconfig/20220127-082735-marostegui.json
[08:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:44] <wikibugs>	 (03CR) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff)
[08:27:48] <wikibugs>	 (03PS3) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450
[08:28:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff)
[08:28:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19399 and previous config saved to /var/cache/conftool/dbconfig/20220127-082847-marostegui.json
[08:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I think you might be rewriting this over and over again because you have this commit in different local branches or something. "git review" [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth)
[08:33:32] <jayme>	 !log uploaded scap 4.2.1 to apt.wikimedia.org - T300058
[08:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:36] <stashbot>	 T300058: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058
[08:34:22] <wikibugs>	 (03PS4) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450
[08:37:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19400 and previous config saved to /var/cache/conftool/dbconfig/20220127-083740-root.json
[08:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:37] <jayme>	 !log updated scap to 4.2.1 on A:mw-canary, A:parsoid-canary, A:mw-jobrunner-canary, A:restbase-canary - T300058
[08:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:43] <stashbot>	 T300058: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058
[08:40:58] <logmsgbot>	 !log jayme@deploy1002 Started deploy [restbase/deploy@0848b15]: scap testing
[08:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:03] <logmsgbot>	 !log jayme@deploy1002 Finished deploy [restbase/deploy@0848b15]: scap testing (duration: 00m 05s)
[08:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:27] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff)
[08:43:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19401 and previous config saved to /var/cache/conftool/dbconfig/20220127-084352-marostegui.json
[08:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19402 and previous config saved to /var/cache/conftool/dbconfig/20220127-085244-root.json
[08:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:40] <wikibugs>	 (03PS3) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199)
[08:58:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199)
[08:58:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19403 and previous config saved to /var/cache/conftool/dbconfig/20220127-085857-marostegui.json
[08:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:54] <wikibugs>	 (03PS9) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305
[09:00:07] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304
[09:01:08] <wikibugs>	 (03PS3) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967)
[09:01:57] <wikibugs>	 (03PS2) 10Filippo Giunchedi: conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199)
[09:02:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] conftool: add prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/757612 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[09:04:13] <wikibugs>	 (03PS1) 10Hashar: ci: ensure rsync is on all WMCS CI agents [puppet] - 10https://gerrit.wikimedia.org/r/757613 (https://phabricator.wikimedia.org/T300236)
[09:06:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/757613 (https://phabricator.wikimedia.org/T300236) (owner: 10Hashar)
[09:07:45] <wikibugs>	 (03PS10) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859
[09:07:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19404 and previous config saved to /var/cache/conftool/dbconfig/20220127-090747-root.json
[09:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:14] <wikibugs>	 (03PS2) 10DCausse: aptrepo: add an elastic68 component [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666)
[09:14:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298559)', diff saved to https://phabricator.wikimedia.org/P19405 and previous config saved to /var/cache/conftool/dbconfig/20220127-091401-marostegui.json
[09:14:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:07] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[09:14:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:14:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[09:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[09:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:14:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:14:37] <wikibugs>	 (03PS4) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967)
[09:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:39] <wikibugs>	 (03PS1) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967)
[09:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19406 and previous config saved to /var/cache/conftool/dbconfig/20220127-091440-marostegui.json
[09:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19407 and previous config saved to /var/cache/conftool/dbconfig/20220127-091453-marostegui.json
[09:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/757616
[09:16:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19408 and previous config saved to /var/cache/conftool/dbconfig/20220127-091641-marostegui.json
[09:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:46] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[09:17:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) (owner: 10DCausse)
[09:17:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[09:18:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[09:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[09:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:47] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1007
[09:20:02] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff)
[09:22:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[09:22:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19409 and previous config saved to /var/cache/conftool/dbconfig/20220127-092251-root.json
[09:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:38] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624
[09:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:43] <stashbot>	 T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624
[09:23:43] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624
[09:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:32] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10Volans) p:05Triage→03Medium
[09:27:15] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus2006.codfw.wmnet
[09:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:26] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus2005.codfw.wmnet
[09:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:11] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624)
[09:29:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19410 and previous config saved to /var/cache/conftool/dbconfig/20220127-092957-marostegui.json
[09:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19411 and previous config saved to /var/cache/conftool/dbconfig/20220127-093146-marostegui.json
[09:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff)
[09:35:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui)
[09:36:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757616 (owner: 10Muehlenhoff)
[09:37:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19412 and previous config saved to /var/cache/conftool/dbconfig/20220127-093755-root.json
[09:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[09:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add k8s masters in codfw eBGP config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:42:54] <wikibugs>	 (03PS1) 10Marostegui: switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605)
[09:44:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui)
[09:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: switchover-tmpl.sh: Tendril is no more [software] - 10https://gerrit.wikimedia.org/r/757617 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui)
[09:45:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19413 and previous config saved to /var/cache/conftool/dbconfig/20220127-094502-marostegui.json
[09:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19414 and previous config saved to /var/cache/conftool/dbconfig/20220127-094651-marostegui.json
[09:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[09:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:42] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided)
[09:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:56] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 14s)
[09:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[09:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19415 and previous config saved to /var/cache/conftool/dbconfig/20220127-095258-root.json
[09:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:11] <moritzm>	 !log added ganeti1027 to Ganeti eqiad cluster T293909
[09:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:14] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[09:53:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1027.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[09:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:38] <jynus>	 !log Stopped Bacula Director Daemon service at backup1001 T299624
[09:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:43] <stashbot>	 T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624
[09:58:49] <wikibugs>	 (03PS13) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116
[09:59:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[09:59:51] <kormat>	 jouncebot: nowandnext
[09:59:51] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 0 minute(s)
[09:59:51] <jouncebot>	 In 1 hour(s) and 0 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100)
[10:00:02] <marostegui>	 !log Failover m1 from db1159 to db1128 - T299624
[10:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298559)', diff saved to https://phabricator.wikimedia.org/P19416 and previous config saved to /var/cache/conftool/dbconfig/20220127-100007-marostegui.json
[10:00:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:00:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:11] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[10:00:11] <Amir1>	 o/
[10:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:14] <kormat>	 o/
[10:00:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19417 and previous config saved to /var/cache/conftool/dbconfig/20220127-100014-marostegui.json
[10:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:19] <marostegui>	 going for it
[10:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:51] <marostegui>	 all done
[10:00:53] <marostegui>	 checking services
[10:01:05] <jynus>	 prometheus monitoring may complain while bacula is down, that is expected
[10:01:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19418 and previous config saved to /var/cache/conftool/dbconfig/20220127-100127-marostegui.json
[10:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:32] <marostegui>	 etherpad might need a restart
[10:01:35] <marostegui>	 on my way!
[10:01:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19419 and previous config saved to /var/cache/conftool/dbconfig/20220127-100155-marostegui.json
[10:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[10:02:00] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[10:02:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[10:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance
[10:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance
[10:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:28] <marostegui>	 etherpad is back
[10:02:58] <marostegui>	 librenms looks good
[10:03:03] <Amir1>	 welcome back etherpad
[10:03:09] <Emperor>	 glorious victory
[10:03:17] <jynus>	 any service whose "expert" is not around I can check?
[10:03:52] <marostegui>	 orchestrator is clean
[10:03:59] <marostegui>	 jynus: I am thinking about cas/pki
[10:04:03] <marostegui>	 but not sure how we can test them
[10:04:13] <marostegui>	 jbond: moritzm can you confirm those are ok? ^
[10:04:51] <marostegui>	 jynus: want me to merge https://gerrit.wikimedia.org/r/755960 ?
[10:04:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond)
[10:05:00] <jynus>	 marostegui: sure
[10:05:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbbackups: Manually switchover primary stats db db1159 -> db1128 [puppet] - 10https://gerrit.wikimedia.org/r/755960 (https://phabricator.wikimedia.org/T299624) (owner: 10Jcrespo)
[10:05:34] <marostegui>	 jynus: done
[10:06:51] <jynus>	 ok, let me see if the checks are working on the right db
[10:06:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: snmp_exporter: allow polling from all prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292)
[10:06:57] <wikibugs>	 (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757619 (https://phabricator.wikimedia.org/T299624)
[10:07:10] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Thank you for putting WMCS in the loop. In which timeline this refresh should happen? I guess nothing will be done a...
[10:07:56] <jynus>	 I guess puppet must run first
[10:08:00] <jynus>	 doing a manual run
[10:08:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19420 and previous config saved to /var/cache/conftool/dbconfig/20220127-100802-root.json
[10:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:41] <jynus>	 bacula can be down for now until everything else is checked as working
[10:08:57] <wikibugs>	 (03PS1) 10Ladsgroup: maintain-views: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/757622 (https://phabricator.wikimedia.org/T299416)
[10:09:00] <marostegui>	 jynus: from what I can see everything is fine apart from pki and cas that I don't know how to test
[10:09:03] <marostegui>	 jbond: moritzm ^
[10:09:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33471/console" [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292) (owner: 10Filippo Giunchedi)
[10:10:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757619 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui)
[10:11:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] snmp_exporter: allow polling from all prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/757618 (https://phabricator.wikimedia.org/T207292) (owner: 10Filippo Giunchedi)
[10:11:23] <jynus>	 as a side note, I belive db1159 stopped paging and db1128 started FYI marostegui
[10:11:40] <jynus>	 as in, the future, not now
[10:13:03] <jynus>	 all db backups checks on the active icinga instance point now to db1128 too
[10:14:34] <marostegui>	 jynus: yep, and that's good
[10:15:40] <jynus>	 dbprovs updated too after puppet run
[10:15:45] <jynus>	 will restart bacula now
[10:16:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19421 and previous config saved to /var/cache/conftool/dbconfig/20220127-101631-marostegui.json
[10:16:32] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) I don't know details myself but the relevant task is T286588
[10:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:46] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done (i.e. access from `prometheus_nodes` is implicit and no longer needed in individual profiles)...
[10:16:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org
[10:17:48] <jynus>	 !log Started Bacula Director Daemon service at backup1001 T299624
[10:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:53] <stashbot>	 T299624: Switchover m1 master (db1159 -> db1128) - https://phabricator.wikimedia.org/T299624
[10:20:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:20:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19422 and previous config saved to /var/cache/conftool/dbconfig/20220127-102049-marostegui.json
[10:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:54] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[10:21:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org
[10:26:06] <wikibugs>	 (03CR) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse)
[10:26:08] <wikibugs>	 (03PS4) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426
[10:31:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19423 and previous config saved to /var/cache/conftool/dbconfig/20220127-103136-marostegui.json
[10:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: Provide a text_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/757415 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[10:32:35] <wikibugs>	 (03PS4) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199)
[10:32:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: swap prometheus2003 with prometheus2005 [puppet] - 10https://gerrit.wikimedia.org/r/757623 (https://phabricator.wikimedia.org/T296199)
[10:35:50] <Amir1>	 !log creating linktarget table everywhere (T299416)
[10:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:55] <stashbot>	 T299416: Normalize link tables: Create linktarget table - https://phabricator.wikimedia.org/T299416
[10:36:25] <wikibugs>	 (03PS1) 10Marostegui: db1159: Move it to m2 [puppet] - 10https://gerrit.wikimedia.org/r/757626 (https://phabricator.wikimedia.org/T300243)
[10:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff)
[10:38:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1159.eqiad.wmnet with OS bullseye
[10:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:16] <wikibugs>	 (03Abandoned) 10Ladsgroup: maintain-views: Add linktarget [puppet] - 10https://gerrit.wikimedia.org/r/757622 (https://phabricator.wikimedia.org/T299416) (owner: 10Ladsgroup)
[10:39:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1159: Move it to m2 [puppet] - 10https://gerrit.wikimedia.org/r/757626 (https://phabricator.wikimedia.org/T300243) (owner: 10Marostegui)
[10:43:19] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421)
[10:46:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19424 and previous config saved to /var/cache/conftool/dbconfig/20220127-104618-marostegui.json
[10:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:23] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[10:46:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298559)', diff saved to https://phabricator.wikimedia.org/P19425 and previous config saved to /var/cache/conftool/dbconfig/20220127-104641-marostegui.json
[10:46:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:46:46] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[10:46:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19426 and previous config saved to /var/cache/conftool/dbconfig/20220127-104654-marostegui.json
[10:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006
[10:47:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006
[10:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:41] <stashbot>	 T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006
[10:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[10:48:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[10:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:17] <wikibugs>	 (03PS1) 10Ssingh: acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156)
[10:50:02] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33472/console" [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[10:50:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2023.codfw.wmnet with OS bullseye
[10:50:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 T299479', diff saved to https://phabricator.wikimedia.org/P19427 and previous config saved to /var/cache/conftool/dbconfig/20220127-105223-marostegui.json
[10:52:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:28] <stashbot>	 T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479
[10:52:45] <Lucas_WMDE>	 jouncebot: now
[10:52:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 7 minute(s)
[10:52:49] <Lucas_WMDE>	 jouncebot: next
[10:52:49] <jouncebot>	 In 0 hour(s) and 7 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100)
[10:53:02] <Lucas_WMDE>	 I’ll test something on mwdebug1001 for a bit
[10:54:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19428 and previous config saved to /var/cache/conftool/dbconfig/20220127-105408-marostegui.json
[10:54:10] <wikibugs>	 (03PS1) 10Marostegui: db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757629 (https://phabricator.wikimedia.org/T299479)
[10:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:13] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[10:54:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1165.eqiad.wmnet with OS bullseye
[10:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757629 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui)
[10:55:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall! Thank you for kickstarting this" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[10:56:26] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh6001.wikimedia.org
[10:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Update service entry in idp-test for Puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/757630
[10:57:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "What Cole said, and find won't recursively delete directories, so deleting files first is required and then empty directories" [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[10:58:55] <wikibugs>	 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere)
[11:00:05] <jouncebot>	 mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1100).
[11:00:44] <Lucas_WMDE>	 (I’m done on mwdebug1001 again)
[11:01:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19429 and previous config saved to /var/cache/conftool/dbconfig/20220127-110123-marostegui.json
[11:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:58] <wikibugs>	 (03CR) 10Muehlenhoff: Update service entry in idp-test for Puppetboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757630 (owner: 10Muehlenhoff)
[11:07:07] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh6001.wikimedia.org
[11:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[11:09:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19431 and previous config saved to /var/cache/conftool/dbconfig/20220127-110913-marostegui.json
[11:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1159.eqiad.wmnet with OS bullseye
[11:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:56] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967)
[11:10:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632
[11:10:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:10:55] <marostegui>	 ^ expected
[11:11:04] <wikibugs>	 10SRE, 10Math, 10Wikimedia-Mailing-lists: New mailing list for Wikimedia community group math - https://phabricator.wikimedia.org/T300239 (10Physikerwelt) >>! In T300239#7655671, @Ladsgroup wrote: >  - Do you want a public mailing list with archives? Just to confirm Yes! This is most crucial. Similar to the...
[11:11:59] <vgutierrez>	 !log depool cp4031 to be reimaged as cache::text_envoy - T271421
[11:12:00] <wikibugs>	 (03PS1) 10Ssingh: install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T283192)
[11:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:04] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[11:12:13] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:13:21] <wikibugs>	 (03PS2) 10Ssingh: install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T300156)
[11:13:39] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:14:19] <wikibugs>	 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10elukey) Thanks a lot for working on this. Can we also remove the old lib from apt and puppet?  ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent libvarnishapi1 libvarnishapi1 | 5.1.3-1wm11 | stretch-wikimed...
[11:14:57] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:14:57] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:15:05] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] install_server: add MAC address of doh6001 [puppet] - 10https://gerrit.wikimedia.org/r/757633 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[11:15:17] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:16:03] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:16:03] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:16:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19432 and previous config saved to /var/cache/conftool/dbconfig/20220127-111628-marostegui.json
[11:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634
[11:18:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757478
[11:19:09] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:19:31] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:19:31] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:20:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19433 and previous config saved to /var/cache/conftool/dbconfig/20220127-112057-root.json
[11:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19434 and previous config saved to /var/cache/conftool/dbconfig/20220127-112418-marostegui.json
[11:24:20] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:24:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1165.eqiad.wmnet with OS bullseye
[11:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[11:24:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634
[11:24:47] <wikibugs>	 (03PS2) 10Vgutierrez: site: Reimage cp4031 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757627 (https://phabricator.wikimedia.org/T271421)
[11:25:22] <wikibugs>	 (03PS1) 10Ssingh: Add Wikidough's /24 to bgp_out in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/757635
[11:28:55] <wikibugs>	 (03PS1) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300158)
[11:29:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2023.codfw.wmnet with OS bullseye
[11:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:24] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4031.ulsfo.wmnet with OS buster
[11:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:33] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster
[11:29:54] <wikibugs>	 (03PS2) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156)
[11:31:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T285149)', diff saved to https://phabricator.wikimedia.org/P19435 and previous config saved to /var/cache/conftool/dbconfig/20220127-113132-marostegui.json
[11:31:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[11:31:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[11:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:38] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[11:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T285149)', diff saved to https://phabricator.wikimedia.org/P19436 and previous config saved to /var/cache/conftool/dbconfig/20220127-113140-marostegui.json
[11:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: introduce check to verify default grid release [puppet] - 10https://gerrit.wikimedia.org/r/757499 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[11:34:17] <wikibugs>	 (03PS3) 10Muehlenhoff: Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634
[11:34:40] <wikibugs>	 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) @elukey, thanks for pointing that out. Yes we can do that, considering `libvarnishapi1` was only required when we were using `varnish 5.1.x`  of which we no longer use in production, it is safe to hav...
[11:36:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19437 and previous config saved to /var/cache/conftool/dbconfig/20220127-113600-root.json
[11:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298559)', diff saved to https://phabricator.wikimedia.org/P19438 and previous config saved to /var/cache/conftool/dbconfig/20220127-113924-marostegui.json
[11:39:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:39:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:29] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[11:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19439 and previous config saved to /var/cache/conftool/dbconfig/20220127-113931-marostegui.json
[11:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19440 and previous config saved to /var/cache/conftool/dbconfig/20220127-114044-marostegui.json
[11:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:08] <wikibugs>	 (03PS3) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156)
[11:43:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[11:44:32] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:45:48] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:45:50] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:46:48] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:47:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:50:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:50:32] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:50:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1028 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/757634 (owner: 10Muehlenhoff)
[11:51:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19441 and previous config saved to /var/cache/conftool/dbconfig/20220127-115105-root.json
[11:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:34] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:52:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:53:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGk3CgXqD8AxkboJ22zxWQ1CYDhaRuSgiV2A32G+Z9SL aniket@ars
[11:55:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, note that this won't be effective for now as we don't have the routers yet." [homer/public] - 10https://gerrit.wikimedia.org/r/757635 (owner: 10Ssingh)
[11:55:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19442 and previous config saved to /var/cache/conftool/dbconfig/20220127-115548-marostegui.json
[11:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:14] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:57:18] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:57:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757478 (owner: 10Marostegui)
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, and apergos: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1200).
[12:00:30] <Lucas_WMDE>	 looks like there’s nothing to deploy and nobody to train this time
[12:00:33] <apergos>	 no patches in the window, no trainees signed up
[12:00:38] <Lucas_WMDE>	 ok
[12:00:57] <apergos>	 see you next time  :-D
[12:01:02] <Lucas_WMDE>	 ^^
[12:01:06] * Lucas_WMDE goes for lunch
[12:01:15] <apergos>	 going to make soup!
[12:01:27] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[12:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:03:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:03:22] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:04:35] <wikibugs>	 (03PS1) 10Ladsgroup: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757638 (https://phabricator.wikimedia.org/T300006)
[12:05:25] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757638 (https://phabricator.wikimedia.org/T300006) (owner: 10Ladsgroup)
[12:06:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19443 and previous config saved to /var/cache/conftool/dbconfig/20220127-120608-root.json
[12:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[12:06:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[12:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19444 and previous config saved to /var/cache/conftool/dbconfig/20220127-120648-ladsgroup.json
[12:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:53] <stashbot>	 T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006
[12:09:28] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4031.ulsfo.wmnet with OS buster
[12:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster completed: - cp4031 (**WARN*...
[12:10:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19445 and previous config saved to /var/cache/conftool/dbconfig/20220127-121053-marostegui.json
[12:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:00] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:04] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:10] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:34] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:20:20] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:20:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:20:32] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:20:58] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:21:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19446 and previous config saved to /var/cache/conftool/dbconfig/20220127-122113-root.json
[12:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19447 and previous config saved to /var/cache/conftool/dbconfig/20220127-122157-root.json
[12:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:13] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254)
[12:25:33] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2011.codfw.wmnet
[12:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298559)', diff saved to https://phabricator.wikimedia.org/P19448 and previous config saved to /var/cache/conftool/dbconfig/20220127-122558-marostegui.json
[12:26:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:03] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[12:26:06] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts restbase2011.codfw.wmnet
[12:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:46] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:29:04] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:29:18] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:29:18] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:31:03] <wikibugs>	 (03PS1) 10Hnowlan: restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928)
[12:34:09] <wikibugs>	 (03PS14) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116
[12:36:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19449 and previous config saved to /var/cache/conftool/dbconfig/20220127-123617-root.json
[12:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19450 and previous config saved to /var/cache/conftool/dbconfig/20220127-123701-root.json
[12:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:44:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:45:22] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] upgrade codfw1dev to bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[12:50:03] <wikibugs>	 (03PS14) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024)
[12:51:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19451 and previous config saved to /var/cache/conftool/dbconfig/20220127-125120-root.json
[12:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:25] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[12:52:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19452 and previous config saved to /var/cache/conftool/dbconfig/20220127-125205-root.json
[12:52:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: get_cluster_status: output in yaml format [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757650
[12:55:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:55:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[12:55:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[12:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19453 and previous config saved to /var/cache/conftool/dbconfig/20220127-125538-marostegui.json
[12:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:43] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[12:56:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946)
[12:56:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19454 and previous config saved to /var/cache/conftool/dbconfig/20220127-125644-marostegui.json
[12:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:42] <wikibugs>	 (03PS3) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946)
[13:00:12] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: get_cluster_status: output in yaml format [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757650 (owner: 10Arturo Borrero Gonzalez)
[13:02:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33474/console" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:04:21] <wikibugs>	 (03PS4) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946)
[13:05:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:05:35] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:06:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19455 and previous config saved to /var/cache/conftool/dbconfig/20220127-130624-root.json
[13:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 669561024048 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:07:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19456 and previous config saved to /var/cache/conftool/dbconfig/20220127-130708-root.json
[13:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33475/console" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:09:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:10:01] <wikibugs>	 (03PS15) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397)
[13:10:33] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:11:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19457 and previous config saved to /var/cache/conftool/dbconfig/20220127-131148-marostegui.json
[13:11:51] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:11:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 670433700048 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:16:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:19:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:20:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33477/console" [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:20:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:21:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:21:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:21:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19458 and previous config saved to /var/cache/conftool/dbconfig/20220127-132128-root.json
[13:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19459 and previous config saved to /var/cache/conftool/dbconfig/20220127-132212-root.json
[13:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:11] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:23:12] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:26:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[13:26:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[13:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19460 and previous config saved to /var/cache/conftool/dbconfig/20220127-132624-marostegui.json
[13:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:29] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[13:26:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19461 and previous config saved to /var/cache/conftool/dbconfig/20220127-132653-marostegui.json
[13:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:51] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:28:05] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:28:38] <wikibugs>	 (03PS2) 10JMeybohm: Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967)
[13:28:40] <wikibugs>	 (03PS2) 10JMeybohm: Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967)
[13:29:08] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet
[13:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:56] <wikibugs>	 (03CR) 10JMeybohm: Add k8s masters in codfw eBGP config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:30:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:31:00] <wikibugs>	 (03Merged) 10jenkins-bot: Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:32:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.provision for host es1023.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[13:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632
[13:34:53] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:35:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update hook to point to 6.8.23 packages [puppet] - 10https://gerrit.wikimedia.org/r/757632 (owner: 10Muehlenhoff)
[13:36:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19462 and previous config saved to /var/cache/conftool/dbconfig/20220127-133631-root.json
[13:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19463 and previous config saved to /var/cache/conftool/dbconfig/20220127-133652-marostegui.json
[13:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:57] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[13:37:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19464 and previous config saved to /var/cache/conftool/dbconfig/20220127-133715-root.json
[13:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:14] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet
[13:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[13:41:51] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:41:52] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:41:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298559)', diff saved to https://phabricator.wikimedia.org/P19465 and previous config saved to /var/cache/conftool/dbconfig/20220127-134158-marostegui.json
[13:42:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[13:42:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[13:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:03] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[13:42:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:42:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:42:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19466 and previous config saved to /var/cache/conftool/dbconfig/20220127-134209-marostegui.json
[13:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19467 and previous config saved to /var/cache/conftool/dbconfig/20220127-134315-marostegui.json
[13:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1023.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[13:43:22] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet
[13:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:36] <wikibugs>	 (03PS2) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967)
[13:44:54] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967)
[13:45:21] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:45:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1023.eqiad.wmnet with OS bullseye
[13:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:37] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:45:53] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:46:07] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:46:28] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: authorize doh600* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/757628 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[13:46:33] <moritzm>	 !log imported elasticsearch-oss/kibana-oss/logstash-oss 6.8.23 to thirdparty/elastic68 for stretch and bullseye
[13:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:37] <icinga-wm>	 PROBLEM - Disk space on kubemaster2002 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2002&var-datasource=codfw+prometheus/ops
[13:47:16] <jayme>	 that's me
[13:47:29] <jayme>	 BGP & disk space stuff
[13:47:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33478/console" [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[13:48:00] <wikibugs>	 (03PS4) 10Ssingh: site: add role for doh6001 (Wikidough drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/757636 (https://phabricator.wikimedia.org/T300156)
[13:48:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:48:40] <wikibugs>	 (03PS7) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397)
[13:50:19] <wikibugs>	 (03Abandoned) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[13:50:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656
[13:51:12] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656 (owner: 10Muehlenhoff)
[13:51:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19468 and previous config saved to /var/cache/conftool/dbconfig/20220127-135157-marostegui.json
[13:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[13:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:28] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet
[13:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add mapping for new 6.8 component [puppet] - 10https://gerrit.wikimedia.org/r/757656 (owner: 10Muehlenhoff)
[13:53:48] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:53:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:54:34] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 113, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:55:27] <wikibugs>	 (03PS1) 10Ssingh: hieradata: add drmrs to Wikidough and durum sites [puppet] - 10https://gerrit.wikimedia.org/r/757657 (https://phabricator.wikimedia.org/T300156)
[13:56:06] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:56:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hieradata: add drmrs to Wikidough and durum sites [puppet] - 10https://gerrit.wikimedia.org/r/757657 (https://phabricator.wikimedia.org/T300156) (owner: 10Ssingh)
[13:56:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[13:56:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19469 and previous config saved to /var/cache/conftool/dbconfig/20220127-135820-marostegui.json
[13:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:28] <icinga-wm>	 PROBLEM - Disk space on kubemaster2001 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2001&var-datasource=codfw+prometheus/ops
[14:01:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/757616 (owner: 10Muehlenhoff)
[14:03:16] <wikibugs>	 (03PS3) 10JMeybohm: Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967)
[14:03:19] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967)
[14:03:20] <wikibugs>	 (03PS1) 10JMeybohm: Fix nrpe_check_disk_options hiera key for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757658 (https://phabricator.wikimedia.org/T290967)
[14:03:47] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202)
[14:05:14] <moritzm>	 !log installing apache security updates
[14:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:06] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[14:06:27] <wikibugs>	 (03PS16) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397)
[14:06:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Fix nrpe_check_disk_options hiera key for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757658 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[14:07:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19470 and previous config saved to /var/cache/conftool/dbconfig/20220127-140702-marostegui.json
[14:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:17] <icinga-wm>	 RECOVERY - Disk space on kubemaster2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2001&var-datasource=codfw+prometheus/ops
[14:09:20] <icinga-wm>	 RECOVERY - Disk space on kubemaster2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubemaster2002&var-datasource=codfw+prometheus/ops
[14:09:38] <wikibugs>	 (03PS1) 10Elukey: mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130)
[14:10:38] <wikibugs>	 (03CR) 10Elukey: "My understanding is that the class is used only to populate helmfile defaults, lemme know if this is not the case :)" [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[14:13:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19471 and previous config saved to /var/cache/conftool/dbconfig/20220127-141324-marostegui.json
[14:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1023.eqiad.wmnet with OS bullseye
[14:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:26] <icinga-wm>	 PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:52] <wikibugs>	 (03CR) 10Bking: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) (owner: 10DCausse)
[14:22:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T285149)', diff saved to https://phabricator.wikimedia.org/P19473 and previous config saved to /var/cache/conftool/dbconfig/20220127-142206-marostegui.json
[14:22:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:22:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:12] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[14:22:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19474 and previous config saved to /var/cache/conftool/dbconfig/20220127-142214-marostegui.json
[14:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19475 and previous config saved to /var/cache/conftool/dbconfig/20220127-142517-ladsgroup.json
[14:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:22] <stashbot>	 T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006
[14:25:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1028.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[14:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298559)', diff saved to https://phabricator.wikimedia.org/P19476 and previous config saved to /var/cache/conftool/dbconfig/20220127-142829-marostegui.json
[14:28:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[14:28:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[14:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:34] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[14:28:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:28:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:28:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19477 and previous config saved to /var/cache/conftool/dbconfig/20220127-142841-marostegui.json
[14:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19478 and previous config saved to /var/cache/conftool/dbconfig/20220127-143147-marostegui.json
[14:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:18] <wikibugs>	 10SRE, 10Math, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: New mailing list for Wikimedia community group math - https://phabricator.wikimedia.org/T300239 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done. Create an account and go to https://lists.wikimedia.org/postorius/lists/math.lists.wikimedia....
[14:38:27] <wikibugs>	 (03PS1) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666
[14:39:29] <moritzm>	 !log added ganeti1028 to Ganeti eqiad cluster T293909
[14:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:34] <stashbot>	 T293909: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909
[14:39:52] <wikibugs>	 (03PS2) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666
[14:39:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat)
[14:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P19479 and previous config saved to /var/cache/conftool/dbconfig/20220127-144022-ladsgroup.json
[14:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:51] <wikibugs>	 (03PS3) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666
[14:40:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1028.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[14:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:46] <icinga-wm>	 RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:32] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh6002.wikimedia.org
[14:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19480 and previous config saved to /var/cache/conftool/dbconfig/20220127-144652-marostegui.json
[14:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:24] <wikibugs>	 (03CR) 10Ladsgroup: sre.mysql.upgrade: various improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) (owner: 10Volans)
[14:49:49] <wikibugs>	 (03PS4) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056)
[14:50:09] <wikibugs>	 (03PS1) 10DCausse: eventgate-main: update image to 2022-01-27-143826-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757667 (https://phabricator.wikimedia.org/T279541)
[14:52:09] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main: update image to 2022-01-27-143826-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757667 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[14:52:39] <elukey>	 herron: o/ as FYI I moved all rsyslog-kafka clients to the new ca bundle, everything looks good afaics, but lemme know if you see anything weird
[14:53:00] <herron>	 elukey: great! thx for the heads up
[14:53:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez would you kindly send your ssh public key via google chat, or via phabricator with the Add Action Sign with MFA option when you po...
[14:54:15] <ottomata>	 !log continuing deployments of eventgate-main and eventgate-analytics to pick up CA cert changes - T296064 (also deploying eventgate-main for a schema repo bump for search)
[14:54:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @Seddon thanks, I had missed that
[14:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:20] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[14:54:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway)
[14:55:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P19481 and previous config saved to /var/cache/conftool/dbconfig/20220127-145527-ladsgroup.json
[14:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:02] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply on production
[14:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:05] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply on canary
[14:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:33] <dcausse>	 ottomata: ^
[14:57:39] <wikibugs>	 (03CR) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[14:57:59] <dcausse>	 hmm the bot said done but helm was still asking me to confirm
[14:58:10] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync on production
[14:58:11] <ottomata>	 hm
[14:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:56] <dcausse>	 oh there's this "canary" thing
[14:59:20] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh6002.wikimedia.org
[14:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19482 and previous config saved to /var/cache/conftool/dbconfig/20220127-150156-marostegui.json
[15:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:04] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on production
[15:03:04] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply on canary
[15:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:01] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on canary
[15:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:27] <wikibugs>	 (03PS1) 10Btullis: Launch the script with a given process name [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733)
[15:04:31] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on production
[15:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:09] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672076262960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:05:20] <wikibugs>	 (03Abandoned) 10Elukey: helmfile.d: add the istio pod security policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/746880 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[15:06:27] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33479/console" [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis)
[15:07:02] <wikibugs>	 (03PS1) 10MMandere: install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156)
[15:07:13] <wikibugs>	 (03CR) 10Michael DiPietro: upgrade codfw1dev to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[15:07:15] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on canary
[15:07:15] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply on production
[15:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:15] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on canary
[15:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:35] <wikibugs>	 (03PS5) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618)
[15:09:41] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on production
[15:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156) (owner: 10MMandere)
[15:10:15] <dcausse>	 ottomata: all done
[15:10:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19483 and previous config saved to /var/cache/conftool/dbconfig/20220127-151032-ladsgroup.json
[15:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:36] <stashbot>	 T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006
[15:10:51] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs doh second instance [puppet] - 10https://gerrit.wikimedia.org/r/757670 (https://phabricator.wikimedia.org/T300156) (owner: 10MMandere)
[15:13:29] <wikibugs>	 (03PS2) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254)
[15:17:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298559)', diff saved to https://phabricator.wikimedia.org/P19484 and previous config saved to /var/cache/conftool/dbconfig/20220127-151701-marostegui.json
[15:17:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[15:17:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[15:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:07] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[15:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19485 and previous config saved to /var/cache/conftool/dbconfig/20220127-151709-marostegui.json
[15:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of very very minor nits inline, all optional." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond)
[15:20:03] <wikibugs>	 (03PS1) 10Elukey: eventstreams: move kafka config to new ca-bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/757672 (https://phabricator.wikimedia.org/T296064)
[15:20:46] <ottomata>	 dcausse:  yes thank you!
[15:22:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19486 and previous config saved to /var/cache/conftool/dbconfig/20220127-152235-marostegui.json
[15:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:40] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[15:27:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19487 and previous config saved to /var/cache/conftool/dbconfig/20220127-152717-marostegui.json
[15:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:28] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[15:33:57] <wikibugs>	 (03PS1) 10JHathaway: seddon: add ssh key & set kerberos to true [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988)
[15:34:45] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway)
[15:35:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway)
[15:37:13] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams: move kafka config to new ca-bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/757672 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[15:37:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19488 and previous config saved to /var/cache/conftool/dbconfig/20220127-153739-marostegui.json
[15:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:49] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] seddon: add ssh key & set kerberos to true [puppet] - 10https://gerrit.wikimedia.org/r/757673 (https://phabricator.wikimedia.org/T299988) (owner: 10JHathaway)
[15:40:15] <wikibugs>	 (03CR) 10Jbond: O:mail::mx: Add mx specific block list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond)
[15:40:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @Seddon you ssh key has been updated and your kerberos principal has been created, please check your email for details.
[15:42:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19489 and previous config saved to /var/cache/conftool/dbconfig/20220127-154222-marostegui.json
[15:42:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway)
[15:44:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) just made the below comment on a tickety which i though may be usefull to capture here to give some context as to what we have to...
[15:45:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @Miriam & @Ottomata please approve
[15:45:09] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on production
[15:45:09] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on canary
[15:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:36] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on canary
[15:45:37] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on production
[15:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) >The intension of this CR is to slightly role back that decision and exclude the MX hosts from the abuse_nets ferm context rules...
[15:48:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez confirmed their public key via gchat
[15:48:14] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway)
[15:48:25] <wikibugs>	 (03PS2) 10JHathaway: NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336)
[15:48:32] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway)
[15:49:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) Thanks @jhathaway , approved on my end!
[15:49:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) @NRodriguez this change has been committed, should be ready to test in 30 or so minutes.
[15:50:57] <brennen>	 jouncebot now
[15:50:57] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 9 minute(s)
[15:51:00] <brennen>	 jouncebot next
[15:51:00] <jouncebot>	 In 1 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1700)
[15:52:21] <brennen>	 !log train 1.38.0-wmf.19 (T293960): no current blockers; rolling train forward to group1 before log triage meeting
[15:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:27] <stashbot>	 T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960
[15:52:32] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply on canary
[15:52:33] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply on production
[15:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19490 and previous config saved to /var/cache/conftool/dbconfig/20220127-155244-marostegui.json
[15:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) poked both approvers
[15:53:12] <wikibugs>	 (03PS1) 10Elukey: helmfile.d: add circuit breaking settings for ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414)
[15:53:19] <wikibugs>	 (03CR) 10Cwhite: centrallog: clean up old /srv/syslog/host directories after grace period (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[15:53:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway)
[15:53:33] <wikibugs>	 (03PS1) 10Brennen Bearnes: group1 wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676
[15:53:35] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676 (owner: 10Brennen Bearnes)
[15:53:42] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on production
[15:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway)
[15:54:37] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on canary
[15:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:54] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757676 (owner: 10Brennen Bearnes)
[15:56:20] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.19  refs T293960
[15:56:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:04] <Amir1>	 brennen: I'm around, ping me if anything goes sideways
[15:57:12] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.19  refs T293960 (duration: 00m 51s)
[15:57:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19491 and previous config saved to /var/cache/conftool/dbconfig/20220127-155726-marostegui.json
[15:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway) a:03jhathaway
[15:58:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[15:59:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[15:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[16:00:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[16:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:46] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[16:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:38] <dcausse>	 !log restarting blazegraph on wdqs1005 (jvm stuck for 2hours)
[16:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[16:04:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672716303920 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:04:48] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:07:10] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:07:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19492 and previous config saved to /var/cache/conftool/dbconfig/20220127-160749-marostegui.json
[16:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:54] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[16:12:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298559)', diff saved to https://phabricator.wikimedia.org/P19493 and previous config saved to /var/cache/conftool/dbconfig/20220127-161231-marostegui.json
[16:12:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[16:12:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[16:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:36] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[16:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19494 and previous config saved to /var/cache/conftool/dbconfig/20220127-161239-marostegui.json
[16:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19495 and previous config saved to /var/cache/conftool/dbconfig/20220127-161344-marostegui.json
[16:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:38] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply on canary
[16:14:38] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply on production
[16:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:02] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on canary
[16:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:38] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on production
[16:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:54] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Launch the script with a given process name [puppet] - 10https://gerrit.wikimedia.org/r/757668 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis)
[16:17:32] <wikibugs>	 (03PS24) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[16:18:31] <wikibugs>	 (03PS1) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587)
[16:19:16] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway)
[16:19:28] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply on main
[16:19:31] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply on canary
[16:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:36] <ottomata>	 elukey:  doing eventstreams ^^
[16:19:42] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:19:47] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[16:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:05] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync on main
[16:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 525 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:20:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) a:03jhathaway
[16:20:29] <elukey>	 ottomata: ack!
[16:20:33] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[16:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:39] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply on main
[16:21:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:42] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply on canary
[16:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Ottomata) Approved.  @Miriam should this account have an expiry_date?
[16:22:39] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync on main
[16:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Ottomata) Also, I'm guessing this user will need Kerberos access too, correct?
[16:23:14] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply on main
[16:23:16] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply on canary
[16:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:00] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[16:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:17] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: sync on main
[16:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:25:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Add Madalina Ana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway)
[16:26:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10Volans) @aborrero thanks for opening this task!  I had a chat with @jbond on what improvements we could make on our side to simply the integrati...
[16:26:49] <wikibugs>	 (03PS1) 10Btullis: Change the date at which the Movement Metrics tasks run [puppet] - 10https://gerrit.wikimedia.org/r/757679 (https://phabricator.wikimedia.org/T295733)
[16:27:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10odimitrijevic) Approved.
[16:27:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10odimitrijevic) Approved.
[16:27:31] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply on production
[16:27:32] <wikibugs>	 (03PS2) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587)
[16:27:33] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply on canary
[16:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:46] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: sync on production
[16:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19496 and previous config saved to /var/cache/conftool/dbconfig/20220127-162849-marostegui.json
[16:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway)
[16:30:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway)
[16:31:39] <wikibugs>	 (03CR) 10JHathaway: Grant skvjold access to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway)
[16:32:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:32:30] <wikibugs>	 (03CR) 10JHathaway: Grant skvjold access to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway)
[16:34:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway)
[16:35:42] <wikibugs>	 (03PS2) 10JHathaway: Grant skvjold access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072)
[16:35:45] <wikibugs>	 (03PS1) 10AOkoth: otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942)
[16:37:09] <wikibugs>	 (03Abandoned) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan)
[16:39:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, feel free to test it on netbox-next before merging it you want to be sure it works as expected." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond)
[16:39:26] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Grant skvjold access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway)
[16:43:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19497 and previous config saved to /var/cache/conftool/dbconfig/20220127-164354-marostegui.json
[16:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) @MNovotny_WMF you should be all set
[16:45:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:48:52] <jinxer-wm>	 (Device rebooted) firing: Device rebooted   - https://alerts.wikimedia.org
[16:49:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:49:31] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:50:21] <ottomata>	 btullis: ^
[16:50:27] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on production
[16:50:27] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on canary
[16:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:09] <btullis>	 Ah sorry. I thought I downtimed it, but I got the individual unit. I missed the sholw systemd. Please ingore it, I'll put it back.
[16:51:30] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on production
[16:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:52] <jinxer-wm>	 (Device rebooted) resolved: Device rebooted   - https://alerts.wikimedia.org
[16:56:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673599983472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:58:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19498 and previous config saved to /var/cache/conftool/dbconfig/20220127-165859-marostegui.json
[16:59:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:59:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:59:03] <wikibugs>	 (03PS1) 10Btullis: Revert the change made for movement_metrics timer [puppet] - 10https://gerrit.wikimedia.org/r/757686 (https://phabricator.wikimedia.org/T295733)
[16:59:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:04] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[16:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19499 and previous config saved to /var/cache/conftool/dbconfig/20220127-165907-marostegui.json
[16:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:56] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase1020.eqiad.wmnet with reason: Firmware upgrade
[16:59:58] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase1020.eqiad.wmnet with reason: Firmware upgrade
[17:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:04] <cmjohnson1>	 !log updating firmware ganeti1007 and ganeti1015 T299527
[17:00:05] <jouncebot>	 jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1700).
[17:00:05] <jouncebot>	 RoanKattouw: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10WDoranWMF) Approved.
[17:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:10] <stashbot>	 T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed  - https://phabricator.wikimedia.org/T299527
[17:00:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19500 and previous config saved to /var/cache/conftool/dbconfig/20220127-170013-marostegui.json
[17:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:19] <rzl>	 RoanKattouw: 👋 looking
[17:00:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert the change made for movement_metrics timer [puppet] - 10https://gerrit.wikimedia.org/r/757686 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis)
[17:00:34] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on canary
[17:00:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1020.eqiad.wmnet
[17:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "this looks good to me -- I agree that we should simply drop all python2 references until we find a case where that breaks something." [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[17:01:15] <cmjohnson1>	 !log updating firmware restbase1020 T299652
[17:01:18] <rzl>	 RoanKattouw: can you get a +1 from someone more familiar please? I'm happy to deploy but I don't want to be the only reviewer on this :)
[17:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:20] <stashbot>	 T299652: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652
[17:01:31] <rzl>	 RoanKattouw: (no need to get it done within the 30min window, ping me any time)
[17:02:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 673623772720 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:02:57] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:27] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on restbase2010 is CRITICAL: cluster=restbase device={sde,sdf,sdg,sdh,sdi,sdj} instance=restbase2010 job=node site=codfw Hnowlan Devices not part of filesystems, host to be decommissioned https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase2010&var-datasource=codfw+prometheus/ops
[17:07:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 674250804784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:07:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "compared in private repo, good to go!" [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:08:16] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:08:32] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:08:41] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+1] otrs: rename hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/757681 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:08:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[17:09:05] <wikibugs>	 (03CR) 10Volans: "Did a first pass, I skipped the test file, I'll get back to it later. Most comments are nits/typos" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[17:09:37] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:10:09] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:13:53] <wikibugs>	 (03PS4) 10Dzahn: otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:14:55] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson)
[17:15:11] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson)
[17:15:15] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1020.eqiad.wmnet
[17:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19501 and previous config saved to /var/cache/conftool/dbconfig/20220127-171518-marostegui.json
[17:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:29] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff both 1007/1015 are updated
[17:16:30] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 674901036592 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:17:17] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase1021.eqiad.wmnet with reason: Firmware upgrade
[17:17:19] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase1021.eqiad.wmnet with reason: Firmware upgrade
[17:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:39] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1021.eqiad.wmnet
[17:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:43] <cmjohnson1>	 !log updating firmware restbase1021 T299652
[17:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:47] <stashbot>	 T299652: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652
[17:21:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yes, reviewed and compiled. currently puppet is broken on otrs1001 and this change fixes it" [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:21:57] <wikibugs>	 (03CR) 10Volans: "Did a first pass, nothing major, just nits." [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[17:22:20] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[17:22:22] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:12] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33481/" [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:24:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway)
[17:29:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) ssh key confirmed via gchat
[17:30:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19502 and previous config saved to /var/cache/conftool/dbconfig/20220127-173022-marostegui.json
[17:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:01] <RoanKattouw>	 rzl: Sure, I'll ask Scott if he's willing to +1 again
[17:31:35] <rzl>	 RoanKattouw: thanks! sorry for the extra runaround
[17:33:41] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1021.eqiad.wmnet
[17:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:57] <wikibugs>	 (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[17:33:58] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1022.eqiad.wmnet with reason: Firmware upgrade
[17:34:00] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1022.eqiad.wmnet with reason: Firmware upgrade
[17:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:10] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1022.eqiad.wmnet
[17:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:34:50] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483
[17:34:56] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483
[17:35:35] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757483 (owner: 10Ladsgroup)
[17:37:06] <wikibugs>	 (03PS1) 10JHathaway: Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688)
[17:39:17] <wikibugs>	 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) p:05Triage→03High
[17:39:27] <wikibugs>	 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[17:41:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:45:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298559)', diff saved to https://phabricator.wikimedia.org/P19503 and previous config saved to /var/cache/conftool/dbconfig/20220127-174527-marostegui.json
[17:45:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[17:45:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[17:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:33] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[17:45:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[17:45:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[17:45:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance
[17:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance
[17:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:45:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:45:58] <taavi>	 brennen: jeena: fyi https://phabricator.wikimedia.org/T299289#7657159
[17:45:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:46:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19504 and previous config saved to /var/cache/conftool/dbconfig/20220127-174606-marostegui.json
[17:46:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:34] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688) (owner: 10JHathaway)
[17:47:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19505 and previous config saved to /var/cache/conftool/dbconfig/20220127-174712-marostegui.json
[17:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:31] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Frances Goodwin to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/757691 (https://phabricator.wikimedia.org/T299688) (owner: 10JHathaway)
[17:47:33] <wikibugs>	 10SRE, 10Traffic-Icebox, 10serviceops: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10RLazarus) 05Open→03Resolved a:03RLazarus Good news! This is long since done, tidying it up.
[17:47:51] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Madalina Ana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway)
[17:48:16] <brennen>	 taavi: thanks.
[17:48:30] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[17:48:32] <taavi>	 it doesn't affect group2, not sure if rollback worthy
[17:51:04] <wikibugs>	 (03PS3) 10JHathaway: Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587)
[17:51:32] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[17:52:39] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1022.eqiad.wmnet
[17:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:52] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1023.eqiad.wmnet with reason: Firmware upgrade
[17:52:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Madalina Ana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757678 (https://phabricator.wikimedia.org/T299587) (owner: 10JHathaway)
[17:52:53] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1023.eqiad.wmnet with reason: Firmware upgrade
[17:52:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1023.eqiad.wmnet
[17:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) I chat with @Jgreen in IRC we will be doing the upgrade next week on Tuesday the 1st at 10:30 CT
[17:55:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) @FGoodwin this should now be setup, please give it a go!
[17:56:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10jhathaway) @Madalina this should now be setup, please give it a try
[17:57:02] <wikibugs>	 (03PS6) 10Krinkle: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy)
[17:57:22] <wikibugs>	 (03PS7) 10Krinkle: multiversion: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy)
[17:57:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675311387056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:57:40] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy)
[17:58:17] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289)
[17:58:38] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Per chat (clearing from review queue)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy)
[17:58:44] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan)
[17:59:30] <wikibugs>	 (03PS1) 10JMeybohm: Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966)
[18:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1800).
[18:00:22] <wikibugs>	 (03CR) 10JMeybohm: "߷" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[18:02:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19506 and previous config saved to /var/cache/conftool/dbconfig/20220127-180217-marostegui.json
[18:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[18:06:03] <rzl>	 RoanKattouw: thanks -- let me know when's a good time, I'll deploy and you can test
[18:06:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 675789368880 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:06:57] <RoanKattouw>	 rzl: Ready whenever you are
[18:07:03] <rzl>	 👍
[18:07:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1023.eqiad.wmnet
[18:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:17] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase1024.eqiad.wmnet with reason: Firmware upgrade
[18:07:19] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase1024.eqiad.wmnet with reason: Firmware upgrade
[18:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:21] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[18:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:34] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1024.eqiad.wmnet
[18:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:40] <rzl>	 RoanKattouw: done, should be at all doc* hosts
[18:10:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697
[18:13:28] <wikibugs>	 (03PS1) 10Bking: deployment-prep: add cergen config for elastic service [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797)
[18:13:46] <RoanKattouw>	 rzl: It's working, thanks!
[18:13:51] <wikibugs>	 (03PS3) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676)
[18:13:51] <rzl>	 \o/
[18:15:05] <wikibugs>	 (03CR) 10Majavah: deployment-prep: add cergen config for elastic service (032 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking)
[18:16:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[18:16:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[18:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19507 and previous config saved to /var/cache/conftool/dbconfig/20220127-181656-marostegui.json
[18:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:01] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[18:17:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19508 and previous config saved to /var/cache/conftool/dbconfig/20220127-181722-marostegui.json
[18:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 676650512232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:20:11] <brennen>	 going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/757485
[18:20:33] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289) (owner: 10Brennen Bearnes)
[18:20:34] <logmsgbot>	 !log mdipietro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS bullseye
[18:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1024.eqiad.wmnet
[18:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:21] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase[1025-1027].eqiad.wmnet with reason: Firmware upgrade
[18:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:25] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase[1025-1027].eqiad.wmnet with reason: Firmware upgrade
[18:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase102[5-7].eqiad.wmnet
[18:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:08] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on production
[18:25:08] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply on canary
[18:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:11] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply on production
[18:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:28] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync on canary
[18:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19509 and previous config saved to /var/cache/conftool/dbconfig/20220127-182627-marostegui.json
[18:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:32] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[18:27:39] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[18:28:38] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) I've drafted the directions for remote hands, translating the above diagram to a step by step direction for them to rack our routers...
[18:29:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 676743411344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:30:40] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298559)', diff saved to https://phabricator.wikimedia.org/P19510 and previous config saved to /var/cache/conftool/dbconfig/20220127-183226-marostegui.json
[18:32:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[18:32:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[18:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:31] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[18:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19511 and previous config saved to /var/cache/conftool/dbconfig/20220127-183234-marostegui.json
[18:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19512 and previous config saved to /var/cache/conftool/dbconfig/20220127-183340-marostegui.json
[18:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Escape various messages in WikibaseMediaInfo" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757485 (https://phabricator.wikimedia.org/T299289) (owner: 10Brennen Bearnes)
[18:38:26] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson)
[18:40:26] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: TBD) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[18:40:37] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700
[18:41:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19513 and previous config saved to /var/cache/conftool/dbconfig/20220127-184132-marostegui.json
[18:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:44] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase102[5-7].eqiad.wmnet
[18:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:15] <wikibugs>	 (03CR) 10Ryan Kemper: elastic: install elasticsearch-oss from component (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[18:42:20] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.38.0-wmf.19/extensions/WikibaseMediaInfo: Backport: [[gerrit:757485|Revert "Escape various messages in WikibaseMediaInfo" (T299289)]] (duration: 00m 52s)
[18:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[18:43:06] <wikibugs>	 (03Abandoned) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:43:07] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply on production
[18:43:08] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply on canary
[18:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:43:25] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync on canary
[18:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:15] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync on production
[18:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:47:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[18:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19514 and previous config saved to /var/cache/conftool/dbconfig/20220127-184845-marostegui.json
[18:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[18:49:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:53] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254)
[18:51:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) a:05Cmjohnson→03Papaul
[18:51:46] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[18:51:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) assigned this to @papaul for codfw portion of the task, removed the ops-eqiad.
[18:52:09] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) a:05RobH→03ayounsi
[18:55:14] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) @Andrew How is this looking so far?
[18:55:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[18:56:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Cmjohnson)
[18:56:23] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138)
[18:56:35] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757702 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[18:56:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19515 and previous config saved to /var/cache/conftool/dbconfig/20220127-185637-marostegui.json
[18:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Cmjohnson) 05Open→03Resolved The switch has been relocated by @Jclark-ctr
[18:59:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add new bullseye eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/757397 (owner: 10Majavah)
[19:00:04] <jouncebot>	 RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T1900).
[19:00:04] <jouncebot>	 tgr: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:31] <wikibugs>	 (03CR) 10Ebernhardson: elastic: install elasticsearch-oss from component (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[19:01:51] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) so far so good!
[19:02:46] <tgr>	 o/
[19:03:01] <tgr>	 I can self-serve
[19:03:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19516 and previous config saved to /var/cache/conftool/dbconfig/20220127-190349-marostegui.json
[19:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan)
[19:06:03] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan)
[19:07:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 677504224816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:09:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:12] <wikibugs>	 (03PS1) 10Cmjohnson: Updating netboot.cfg to reflect change for cloudbackup1003[4] [puppet] - 10https://gerrit.wikimedia.org/r/757705 (https://phabricator.wikimedia.org/T293934)
[19:10:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:10:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:54] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138)
[19:11:02] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Updating netboot.cfg to reflect change for cloudbackup1003[4] [puppet] - 10https://gerrit.wikimedia.org/r/757705 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson)
[19:11:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T285149)', diff saved to https://phabricator.wikimedia.org/P19517 and previous config saved to /var/cache/conftool/dbconfig/20220127-191141-marostegui.json
[19:11:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:46] <stashbot>	 T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149
[19:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: move tests for static-bugzilla to new file for miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[19:13:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics rsync: run on the half hour rather than on the hour [puppet] - 10https://gerrit.wikimedia.org/r/757703 (https://phabricator.wikimedia.org/T300138) (owner: 10Andrew Bogott)
[19:13:32] <mutante>	 in before merge conflict
[19:16:43] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254)
[19:16:45] <wikibugs>	 (03PS1) 10DLynch: Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308)
[19:16:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:16:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 678217648432 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:18:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "If that's where it is, then that's where it is" [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[19:18:31] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757729 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[19:18:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298559)', diff saved to https://phabricator.wikimedia.org/P19518 and previous config saved to /var/cache/conftool/dbconfig/20220127-191854-marostegui.json
[19:18:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[19:18:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[19:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:59] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[19:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19519 and previous config saved to /var/cache/conftool/dbconfig/20220127-191902-marostegui.json
[19:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:13] <urbanecm>	 tgr: hey! Can you let me know when you're done with deployment?
[19:19:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:33] <Kemayo>	 Is the backport window still sufficiently open that I could slip a config patch into it, or should I go sign up for the next window?
[19:20:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19520 and previous config saved to /var/cache/conftool/dbconfig/20220127-192009-marostegui.json
[19:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:41] <urbanecm>	 Kemayo: i feel like it should be possible to do your patch now (once tgr finishes)
[19:21:10] <Kemayo>	 urbanecm: Nifty. In that case, it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757730
[19:21:33] <urbanecm>	 Kemayo: noted. Can you add it to the calendar too? :))
[19:23:44] <Kemayo>	 urbanecm: Okay, it's added.
[19:25:58] <wikibugs>	 (03CR) 10Dzahn: "old tests fixed. new tests work on k8s: [deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb-k8s.yaml --hosts miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[19:26:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 678217648656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:27:56] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752657|GrowthExperiments: Start add image experiment for desktop users (T298122)]] (duration: 00m 51s)
[19:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:02] <stashbot>	 T298122: Add an image: experiment (desktop) - https://phabricator.wikimedia.org/T298122
[19:28:17] <tgr>	 > 19:27:37 Check 'Logstash Error rate for mw1450.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00)
[19:28:25] <tgr>	 sounds foreboding
[19:28:47] <wikibugs>	 (03CR) 10Dzahn: "Jaime, the role that used this fileset has been removed. static-bugzilla moved from ganeti/puppet to kubernetes. The data is stored in a s" [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[19:29:33] <wikibugs>	 (03Abandoned) 10Dzahn: miscweb: bump version to 2022-01-25-150544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757060 (owner: 10Dzahn)
[19:30:02] <tgr>	 would be nice to have a logstash link in that message.
[19:30:34] <urbanecm>	 probably :)
[19:30:52] <urbanecm>	 but if it's the only server with higher rate, i'd say it's fine
[19:30:54] <urbanecm>	 it can be a temp fluke
[19:30:55] <tgr>	 the mediawiki-errors dashboard at least gives 0 errors for that host.
[19:31:00] <urbanecm>	 good
[19:31:54] <tgr>	 I'll call it done
[19:32:09] <urbanecm>	 okay
[19:32:12] <urbanecm>	 so can i take over tgr now?
[19:32:25] <tgr>	 yes, thx
[19:32:37] <urbanecm>	 Kemayo: hey! We can do your patch now. Still around?
[19:32:54] <Kemayo>	 urbanecm: I am ready
[19:33:05] <urbanecm>	 great! I'll start and let you know when ready for testing
[19:33:10] <tgr>	 we do have a metric ton of 'DBConnRef::numRows was deprecated' errors on other hosts - looks like that deprecation was a bit premature
[19:33:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch)
[19:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: Launch DiscussionTools new topic tool a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757730 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch)
[19:34:57] <urbanecm>	 Kemayo: it's at mwdebug1001, can you test please?
[19:35:05] <Kemayo>	 Sure thing, one second.
[19:35:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19521 and previous config saved to /var/cache/conftool/dbconfig/20220127-193514-marostegui.json
[19:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:41] <wikibugs>	 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Jdforrester-WMF) 05Open→03Resolved Success:  {F34933393, size=full}
[19:36:14] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:36:18] <mutante>	 !log purging font* / xfont* packages from further eqiad appservers (mw14*) for T294378
[19:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:22] <stashbot>	 T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378
[19:37:31] <Kemayo>	 urbanecm: Okay, looks good! (Sorry, it took a second because it had logged in and logged out bits to test.)
[19:37:39] <urbanecm>	 no problem :)
[19:37:41] <urbanecm>	 syncing
[19:38:52] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2c8561c1c0aa6b4f5f8202972b7b28723337e88e: Launch DiscussionTools new topic tool a/b test (T291308) (duration: 00m 51s)
[19:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:57] <stashbot>	 T291308: Make config change to start New Discussion Tool A/B Test - https://phabricator.wikimedia.org/T291308
[19:39:17] <urbanecm>	 Kemayo: should be live!
[19:39:18] <urbanecm>	 anything else?
[19:39:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:02] <Kemayo>	 urbanecm: nothing else, thanks!
[19:40:12] <urbanecm>	 np
[19:40:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:40:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:12] <mutante>	 !log purging font packages from wtp* (parsoid eqiad) 
[19:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:49] <mutante>	 !log purging font packages from parse* (parsoid codfw) 
[19:43:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:09] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) purged from all of parsoid (wtp* and parse*) and the rest of eqiad (mw14*)
[19:46:07] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen)
[19:47:07] <wikibugs>	 (03PS2) 10Urbanecm: Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243)
[19:47:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[19:48:04] <wikibugs>	 (03PS1) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927)
[19:48:06] <wikibugs>	 (03PS1) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927)
[19:48:08] <wikibugs>	 (03PS1) 10Jdlrobson: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924)
[19:48:17] <wikibugs>	 (03Merged) 10jenkins-bot: Do not set wgTrustedXffFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749734 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[19:49:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson)
[19:49:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson)
[19:49:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:50:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19524 and previous config saved to /var/cache/conftool/dbconfig/20220127-195019-marostegui.json
[19:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:43] <wikibugs>	 (03PS2) 10Jdlrobson: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924)
[19:50:51] <wikibugs>	 (03CR) 10Jdlrobson: "Clare can you check this one and deploy it along with the idwiki change ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:51:02] <wikibugs>	 (03PS2) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927)
[19:51:15] <wikibugs>	 (03PS2) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927)
[19:52:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:37] <wikibugs>	 (03PS3) 10Urbanecm: Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243)
[19:52:49] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6fa62c58c04929d7327d8f07dbd32b6139f58ccf: Do not set wgTrustedXffFile (T298243) (duration: 00m 51s)
[19:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:54] <stashbot>	 T298243: Finish removal of wgTrustedXffFile  - https://phabricator.wikimedia.org/T298243
[19:53:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:53:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[19:53:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:59] <wikibugs>	 (03Merged) 10jenkins-bot: Remove trusted-xff.php from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749735 (https://phabricator.wikimedia.org/T298243) (owner: 10Urbanecm)
[19:58:39] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized docroot/noc/: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 1/3) (duration: 00m 50s)
[19:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:43] <stashbot>	 T298243: Finish removal of wgTrustedXffFile  - https://phabricator.wikimedia.org/T298243
[19:59:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:56] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 2/3) (duration: 00m 51s)
[20:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T2000).
[20:00:27] <brennen>	 o/
[20:00:43] <urbanecm>	 brennen: just a last sync-file remaining
[20:00:43] <urbanecm>	 sorry!
[20:00:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:00:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:49] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized phpcs.xml: 11498603a918863c08300b4abfc69491424ebe14: Remove trusted-xff.php from wmf-config (T298243; 3/3) (duration: 00m 50s)
[20:01:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:55] <brennen>	 urbanecm: no rush!
[20:01:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:02] <urbanecm>	 brennen: I'm done now :)
[20:02:35] <brennen>	 cool, thanks.  rolling train.
[20:03:12] <urbanecm>	 thanks
[20:03:22] <brennen>	 !log train 1.38.0-wmf.19 (T293960): no current blockers; logs clean-ish, rolling train forward to group2
[20:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:29] <stashbot>	 T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960
[20:04:10] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738
[20:04:12] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738 (owner: 10Brennen Bearnes)
[20:04:54] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.19  refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757738 (owner: 10Brennen Bearnes)
[20:05:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298559)', diff saved to https://phabricator.wikimedia.org/P19525 and previous config saved to /var/cache/conftool/dbconfig/20220127-200523-marostegui.json
[20:05:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[20:05:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[20:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:35] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[20:05:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19526 and previous config saved to /var/cache/conftool/dbconfig/20220127-200535-marostegui.json
[20:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:19] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.19  refs T293960
[20:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19527 and previous config saved to /var/cache/conftool/dbconfig/20220127-200641-marostegui.json
[20:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:36] <wikibugs>	 (03PS1) 10Majavah: openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254)
[20:08:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:08:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[20:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[20:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:06] <wikibugs>	 10SRE: Issue installing ca-certificates-java - https://phabricator.wikimedia.org/T300300 (10colewhite)
[20:11:31] <wikibugs>	 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite)
[20:11:55] <wikibugs>	 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite)
[20:13:03] <wikibugs>	 (03PS1) 10Ssingh: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158)
[20:15:38] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700
[20:17:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[20:17:53] <wikibugs>	 (03PS1) 10Majavah: backy2: don't install python3-crypto in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757742 (https://phabricator.wikimedia.org/T300254)
[20:21:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19528 and previous config saved to /var/cache/conftool/dbconfig/20220127-202145-marostegui.json
[20:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10FGoodwin) I'm set up, thanks so much!
[20:27:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] backy2: don't install python3-crypto in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757742 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[20:29:11] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+1] openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[20:32:16] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700
[20:32:38] <wikibugs>	 (03CR) 10Ryan Kemper: "Removed the befores" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[20:36:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19529 and previous config saved to /var/cache/conftool/dbconfig/20220127-203650-marostegui.json
[20:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:34] <wikibugs>	 (03PS1) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254)
[20:37:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to AQS Cassandra cluster for Frances Goodwin - https://phabricator.wikimedia.org/T299688 (10jhathaway) 05Open→03Resolved great!
[20:38:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[20:39:14] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[20:39:41] <wikibugs>	 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10lmata)
[20:39:54] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] openstack: remove few more python 2 packages [puppet] - 10https://gerrit.wikimedia.org/r/757739 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[20:40:23] <wikibugs>	 (03Abandoned) 10Michael DiPietro: upgrade codfw1dev to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757745 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro)
[20:40:26] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10lmata)
[20:46:12] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 680001474560 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:51:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298559)', diff saved to https://phabricator.wikimedia.org/P19530 and previous config saved to /var/cache/conftool/dbconfig/20220127-205155-marostegui.json
[20:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:01] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[21:03:31] <wikibugs>	 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10hnowlan) More context for this issue in T289694
[21:19:22] <wikibugs>	 (03PS1) 10Volans: management: remove deprecated module [software/spicerack] - 10https://gerrit.wikimedia.org/r/757747
[21:26:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[21:36:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[21:41:57] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[21:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:22] <wikibugs>	 (03PS4) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[22:01:23] <wikibugs>	 (03CR) 10JHathaway: "Filippo would you kindly take a look at the reworked alert config" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[22:04:26] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[22:52:22] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10Andrew)
[22:57:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS buster
[22:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:31] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.wikimedia.org with OS buster
[23:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS buster
[23:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:51] <wikibugs>	 (03PS1) 10Bking: wcqs: populate journal var to fix puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:13:30] <wikibugs>	 (03PS2) 10Ryan Kemper: wcqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:17:01] <wikibugs>	 (03PS3) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:17:03] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[23:18:02] <wikibugs>	 (03PS4) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:20:41] <wikibugs>	 (03PS5) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:21:09] <wikibugs>	 (03PS6) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:21:51] <wikibugs>	 (03CR) 10Bking: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:21:53] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.wikimedia.org with OS buster
[23:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.wikimedia.org with OS bullseye
[23:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:04] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:25:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[23:30:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[23:32:19] <wikibugs>	 (03PS7) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310)
[23:32:42] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:41:04] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:41:10] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757769 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:48:08] <wikibugs>	 (03PS1) 10Bking: wdqs: populate journal var to fix puppet failure [puppet] - 10https://gerrit.wikimedia.org/r/757772
[23:49:38] <wikibugs>	 (03PS2) 10Bking: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310)
[23:49:52] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:49:55] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:50:19] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:50:54] <wikibugs>	 (03PS1) 10Cwhite: hiera: set domainrw to grafana-next-rw in codfw [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863)
[23:50:56] <wikibugs>	 (03PS1) 10Cwhite: graphite: add grafana-next-rw to cors origins [puppet] - 10https://gerrit.wikimedia.org/r/757775 (https://phabricator.wikimedia.org/T282863)
[23:51:00] <wikibugs>	 (03PS1) 10Cwhite: idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863)
[23:51:02] <wikibugs>	 (03PS1) 10Cwhite: hiera: add grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863)
[23:51:04] <wikibugs>	 (03PS1) 10Cwhite: hiera: configure mapping and cache rules for grafana-next-rw [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863)
[23:51:53] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking)
[23:53:05] <wikibugs>	 (03PS1) 10Cwhite: wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863)
[23:57:00] <wikibugs>	 (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[23:59:03] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)