[04:36:28] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, 10Patch-For-Review: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui)
[04:51:42] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[04:53:08] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db1178 is clean
[04:58:27] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, 10Patch-For-Review: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui)
[05:11:20] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) @jcrespo can coordinate better the dbprov downtimes, I am swapping names there :)
[05:11:34] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui)
[05:15:53] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Marostegui) All hosts silenced. Master binary's upgraded, waiting now to perform the restart at 06:00 AM UTC
[05:51:54] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team (Radar): Post-deployment: (partly) ramp parser cache retention back up - https://phabricator.wikimedia.org/T280604 (10Marostegui)
[05:52:20] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team (Radar): Post-deployment: (partly) ramp parser cache retention back up - https://phabricator.wikimedia.org/T280604 (10Marostegui) a:05Marostegui→03None Not assigning it to me specifically, as anyone could pick this up after the mitigation
[06:02:18] <wikibugs>	 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui)
[06:02:38] <wikibugs>	 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui)
[06:03:08] <wikibugs>	 10DBA, 10SRE, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10Marostegui)
[06:03:14] <wikibugs>	 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) 05Open→03Resolved All hosts have been upgraded
[06:03:51] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Marostegui) 05Open→03Resolved This was done. RO starts: 06:00:15 RO stops: 06:00:46  Total RO time: 31 seconds
[06:18:05] <wikibugs>	 10DBA, 10AbuseFilter, 10mariadb-optimizer-bug: Check whether `FORCE INDEX page_timestamp` is still needed in LazyVariableComputer.php - https://phabricator.wikimedia.org/T281579 (10Marostegui) 05Open→03Stalled This query is still filesorting on 10.1 and takes around 30 seconds to complete. ` root@PRODUCT...
[06:38:08] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db1178 is slowly being pooled into s8
[06:38:17] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui)
[06:49:24] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) s2 sanitarium master db1074 has been replaced by db1156
[06:50:24] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[06:53:49] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui)
[06:54:07] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) Wait a few days to make sure its replacement (db1156) works fine.
[06:54:43] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui)
[06:54:46] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[06:54:48] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[06:55:18] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:13:32] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Marostegui)
[07:13:42] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Marostegui) I have depooled this host
[08:05:30] <moritzm>	 I'm installing  new cumin host with buster and the role includes profile::mariadb::packages_client, it selects mariadb variants based on the distro
[08:05:53] <moritzm>	 what about bullseye, should I rebuild wmf-mariadb104-client for it?
[08:06:41] <marostegui>	 moritzm: yeah, let's do that
[08:06:52] <marostegui>	 moritzm: I can try to do it too, but definitely not this week
[08:07:08] <marostegui>	 not sure if jaime has done it already (i believe he was trying bullseye before)
[08:10:39] <moritzm>	 I'll check if anything is up on deneb, otherwise I'll build/import it
[08:11:24] <marostegui>	 moritzm: sure, I will try to get my bullseye environment ready and try it next week
[08:11:38] <marostegui>	 I will ping jaime also to see if he maybe got a package once he gets online
[08:25:28] <marostegui>	 jynus: I was chatting with moritzm earlier, and I was wondering if you ever built the wmf104 client package for bullseye, or was it 10.5?
[08:27:58] <jynus>	 I don't think I ended up builting any package, just downloading and testing it locally
[08:28:07] <marostegui>	 ah ok, good
[08:28:09] <marostegui>	 thanks :)
[08:28:22] <jynus>	 do you want me to?
[08:29:01] <marostegui>	 no no, no worries
[08:37:27] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) @Papaul dbprov2002 should be shut down carefully to make sure data is kept intact (I'd prefer to do so). Otherwise, it can be down for e.g. 1 day.Will it need IP changes done beforehand?...
[08:38:27] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo)
[08:46:44] <jynus>	 was there any es restarts recently?
[08:47:29] <jynus>	 from what I see from tendril, no
[08:48:13] <marostegui>	 nope
[08:49:04] <jynus>	 all es backups failed yesterday
[08:49:13] <jynus>	 both eqiad and codfw
[08:49:40] <marostegui>	 wow
[08:49:46] <jynus>	 MySQL server has gone away
[08:49:47] <marostegui>	 definitely no restarts on both dcs
[08:50:05] <jynus>	 I wonder if we have found why transfer fails?
[08:50:13] <jynus>	 unstable network
[08:50:42] <marostegui>	 mysql uptime looks also up, so no mysql restarts either (ie: crash)
[08:51:09] <jynus>	 yeah, it would be very weird it happens on all 4 servers
[08:51:56] <jynus>	 let me tell you the timestamps to discard maintenance (which would be a good thing)
[08:52:13] <jynus>	 if it fails for a good reason, I am not worried
[08:52:23] <marostegui>	 There was no maintenance on es servers that I know of
[08:52:45] <jynus>	 codfw es4: 2021-05-04 07:58:21
[08:53:00] <jynus>	 codfw es5: 2021-05-04 07:58:21
[08:53:10] <jynus>	 the fact it is the same time there would point to network
[08:53:40] <marostegui>	 yeah
[08:53:52] <kormat>	 jynus: do backups stop trying after one failure?
[08:53:53] <jynus>	 can you check if there was any weirdness on es host there at the time? I am thinking it is possible network at the backup side, but to discard at the mysql side
[08:54:51] <jynus>	 kormat, yes and no
[08:55:09] <jynus>	 yes because we don't have the space to store multiple temporary backups
[08:55:21] <jynus>	 temporary as "in generation"
[08:55:40] <jynus>	 no because the next schedule will attempt backups again
[08:56:29] <jynus>	 I am going to check networking on generating hosts
[08:59:39] <marostegui>	 jynus: any specific host you want me to check?
[09:00:10] <jynus>	 es2022 and es2025
[09:00:35] <marostegui>	 ok, going to take a look
[09:01:20] <jynus>	 at  2021-05-04 07:58:21
[09:02:02] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10elukey)
[09:04:13] <jynus>	 I don't see anything weird, no tcp errors, no network at 0 bytes, it just stopped backing up
[09:04:24] <jynus>	 other resources were not saturated
[09:04:37] <marostegui>	 jynus: they both look clean
[09:04:41] <marostegui>	 mysql-wise
[09:05:28] <jynus>	 no kills?
[09:05:37] <marostegui>	 nope
[09:05:44] <marostegui>	 last log entry is at 13 april
[09:05:46] <marostegui>	 for kills
[09:07:24] <jynus>	 there was a spike of lag at 8:04?
[09:08:11] <jynus>	 and a spike of aborted clients- which means "the client stopped responding"
[09:08:15] <marostegui>	 I don't see that on es2022 or es2025
[09:08:59] <jynus>	 please recheck: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2022&var-port=9104&from=1620114490073&to=1620115805893
[09:09:07] <jynus>	 maybe I am looking wrongly
[09:09:56] <jynus>	 the thing is "aborted clients" didn't tell us more than the backups log told us
[09:10:17] <jynus>	 "Lost connection to MySQL server during query", which we already knew
[09:10:22] <jynus>	 it doesn't tell us why
[09:10:31] <marostegui>	 are we sure that lag is real?
[09:10:42] <marostegui>	 even the graph itself says 0 second
[09:10:45] <marostegui>	 so maybe a graphing issue?
[09:11:02] <jynus>	 real is likely, but it is a bit far from the timestamp we are interested
[09:12:16] <jynus>	 what do you mean? It says "2 seconds" for probably a single collection
[09:13:09] <jynus>	 it just happens >5 minutes after the issue, so not probably related
[09:13:19] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jijiki)
[09:13:44] <jynus>	 the fact that it happened on 2 servers at the same time, for me it indicates network or app at client, not server
[09:14:12] <marostegui>	 es2025 has no errors on its network iface
[09:14:19] <marostegui>	 so yeah, the servers themselves look ok
[09:14:50] <jynus>	 which would be ok, but it happened on both datacenters- too frequently
[09:15:34] <marostegui>	 maybe we can check switches graphs to see if maybe the port got saturated or something?
[09:16:52] <jynus>	 let me check eqiad, to see if there is something else there
[09:18:33] <jynus>	 backup1002 failed at 2021-05-04 00:01:49
[09:18:39] <jynus>	 again both dumps at the same time
[09:20:21] <jynus>	 I will continue investigating on my own, will report here if I find something
[09:20:34] <marostegui>	 ok!
[09:20:52] <jynus>	 thanks for checking mysql, I am now sure it is not mysql
[09:21:30] <marostegui>	 yeah, both hosts at the same time it is very unlikely, it must be something from the client
[09:21:44] <jynus>	 I see rsyslog failing at the same time
[09:22:35] <jynus>	 "omkafka: action will suspended due to kafka error -195: Local: Broker transport failure"
[09:22:53] <marostegui>	 are both hosts connected to the same switch?
[09:23:11] <jynus>	 this is on the same host
[09:26:08] <jynus>	 I found nothing, but I realized a theory for random transfer.py failures
[09:26:54] <jynus>	 there is a service that cleans up temporary files on buster- maybe that is interacting badly with temporary files (locks and md5sum) created by transfer
[09:27:01] <jynus>	 something to check at a later time
[09:27:46] <jynus>	 (but only on long running transfers)
[09:29:06] <marostegui>	 doe sit use /tmp?
[09:29:09] <marostegui>	 I mean transfer
[09:29:27] <jynus>	 it uses some path, let me check
[09:30:38] <jynus>	 yeah, it uses /tmp
[09:30:58] <jynus>	 and maybe the behaviour changed from "delete on every reboot"
[09:31:07] <jynus>	 to delete with the service every X hours
[09:31:10] <marostegui>	 maybe move it to /var/run ?
[09:31:15] <marostegui>	 to test, I mean
[09:32:44] <jynus>	 yeah, that would be super easy
[09:33:07] <jynus>	 I will add a note to the ticket as a potential trigger of the issue
[09:33:29] <jynus>	 as I just happened to run with an unrelated log entry saying "running clean up of /tmp, etc."
[09:34:55] <jynus>	 regarding this issue, I don't see ethernet, kernel or other relevant system logs
[09:35:07] <jynus>	 other than rsyslog failures
[09:36:51] <jynus>	 and that happens very frequently, so not really a clue
[09:47:19] <marostegui>	 yeah, all the logs look clean
[09:47:21] <marostegui>	 on both codfw hosts
[10:18:58] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Trizek-WMF)
[10:19:08] <wikibugs>	 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) Another potential reason for errors is the service that cleans up temporary files (systemd-tmpfiles-clean.timer). transfer.py uses /tmp for a couple or reasons (locking, checksumming, xtrabackup te...
[10:51:01] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s2 - https://phabricator.wikimedia.org/T281826 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1122 (171978786) before adding s2 to orchestrator
[10:51:03] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[10:51:16] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[11:08:23] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) I am working on the migration document, making it a lot more detailed and with actual commands. Once that looks good, I will try the procedure on our testi...
[11:09:27] <wikibugs>	 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) a:03Marostegui
[11:18:41] <jynus>	 I have a bullseye transferpy package ready for bullseye, but it will depend on python3-wmfmariadbpy-remote
[11:24:39] <jynus>	 same for wmfbackups ones - I have uploaded for now to apt1001:/home/jynus/bullseye CC moritzm
[12:13:44] <kormat>	 jynus: hey. i'm looking at reimaging the candidate master for s6 in codfw to buster. i see that there is both a stretch and buster backup source there. what coordination is required before i proceed? does https://gerrit.wikimedia.org/r/c/operations/puppet/+/681621 get merged before, or after?
[12:14:12] <kormat>	 mm. well. it can't be before, i think. my guess is that it gets merged when the s6 _master_ (not candidate) master gets reimaged
[12:14:12] <jynus>	 at any time, really
[12:14:45] <kormat>	 ah, ok, nevermind me then :)
[12:14:46] <jynus>	 it will mean that backups are started to be taken from the buster hosts
[12:14:54] <jynus>	 there is no hard dependency
[12:15:02] <jynus>	 more like, whenever you think that's adequate
[12:15:16] <jynus>	 that's the good thing about having the choice :-)
[12:15:56] <jynus>	 but probably "around the same time of the switchover"
[12:16:21] <kormat>	 from s6 master to s6 candidate master?
[12:16:25] <kormat>	 or do you mean the dc switchover?
[12:16:44] <jynus>	 sorry I wasn't clear, the master upgrade, I meant
[12:16:50] <kormat>	 gotcha
[12:16:59] <jynus>	 but again, it is not a hard dependency
[12:17:02] <kormat>	 ok, in that case i'll go ahead with the candidate master upgrade today
[12:17:10] <jynus>	 the idea is that, from the moment it is deployed
[12:17:20] <jynus>	 we will generate primarilly buster backups
[12:17:44] <jynus>	 (the logical ones will still work for any os/version)
[12:20:58] <wikibugs>	 10Data-Persistence-Backup, 10Goal: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo)
[12:22:44] <jynus>	 there will be one further step on my side (in case you are documenting the list) which is getting rid of the -then- unused stretch instance
[12:23:11] <jynus>	 and that can be at the end of all other steps- when we are 100% sure we will not revert anything (cleanup)
[12:24:06] <jynus>	 I will add it to T280751
[12:24:08] <stashbot>	 T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751
[12:28:48] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo)
[12:29:07] <jynus>	 ^ feel free to improve on that, that is my best take for now
[12:29:16] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat)
[12:29:36] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) Turns out the candidate master for s6/codfw (db2114) is already running buster/10.4.
[12:29:44] <jynus>	 lol
[12:30:11] <marostegui>	 you are welcome!
[12:30:23] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo)
[13:15:41] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts: ` ['db2129.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20210505131...
[13:57:09] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s6 is done, pending the master. It will be finished once we've completed the migration to 10.4 on T280751
[13:57:19] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s6 is done, pending the master. It will be finished once we've completed the migration to 10.4 on T280751
[13:57:23] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s6 is done, pending the master. It will be finished once we've completed the migration to 10.4 on T280751
[13:57:31] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui)
[13:57:38] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui)
[13:57:42] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui)
[14:05:44] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui)
[14:05:47] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui)
[14:05:50] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui)
[14:06:35] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui)
[14:06:37] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui)
[14:06:41] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui)
[14:19:01] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s8 is fully done apart from the master (db1104)
[14:35:58] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s5 - https://phabricator.wikimedia.org/T281828 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1100 (171974853) before adding s5 to orchestrator
[14:36:00] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[14:36:07] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[14:49:31] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2129.codfw.wmnet'] `  and were **ALL** successful.
[14:57:26] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat)
[14:58:05] <kormat>	 jynus: i think https://gerrit.wikimedia.org/r/c/operations/puppet/+/681621 can be merged now. the s6 master in codfw is now buster.
[14:58:22] <jynus>	 cool, then, thanks for your work!
[14:58:30] <jynus>	 doing now
[14:59:36] <kormat>	 marostegui: db2095:s6 is broken
[14:59:43] <marostegui>	 yep
[14:59:47] <marostegui>	 it is me cleaning up heartbeat
[14:59:55] <kormat>	 ah ok, grand :)
[15:03:02] <marostegui>	 it is now fixed
[15:03:55] <marostegui>	 it is such a pain to clean up the table
[15:04:00] <marostegui>	 sooo easy to screw it
[15:04:26] <kormat>	 and it's hard to blame me when you do
[15:04:29] <kormat>	 inconvenient, i'm sure
[15:08:16] <jynus>	 so by reviewing s5 on codfw, you can see that technicaly there is still a 10.1 instance (db2097) but that just is pending for me to remove (it is passive)
[15:08:19] <jynus>	 sorry
[15:08:21] <jynus>	 I meant s6
[15:09:29] <jynus>	 One thing I could probably improve at some point is not having to manually indicate were to take backups from, but make the software discover/decide smartly
[15:10:56] <jynus>	 some advance machine learning algorithm such as "while (stretch master) {backup from stretch} else {backup from buster}"
[15:12:56] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo)
[15:19:38] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo)
[15:20:04] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo) ^I've prepared the backup failover for eqiad :-)
[15:37:44] <moritzm>	 I tried to build a wmf-mariadb104-client package for Bullseye
[15:38:56] <moritzm>	 it seems to work, but the base path is a little different
[15:39:42] <moritzm>	 it's /opt/mariadb-10.4.18-linux-systemd-x86_64 for me, while the postinst seems to expect the alterantive without the systemd part
[15:40:04] <moritzm>	 or should I have downloaded the sysvinit variant?
[15:40:12] <jynus>	 mmm
[15:41:08] <moritzm>	 it's simple to fix, just wondering if I'm on the wrong path
[15:41:10] <jynus>	 I don't remember being 2 versions before, just 1 with 2 compilation options
[15:42:20] <moritzm>	 current https://mariadb.org/download/ let's me pick the init system and select between systemd and sysvinit
[15:43:03] <jynus>	 that's new
[15:43:20] <jynus>	 or it is the mariadb.com stuff, not the foundation one, not sure
[15:43:37] <jynus>	 oh, I see
[15:43:49] <jynus>	 I think you are bundling a compiled version?
[15:44:03] <jynus>	 we use the source one
[15:44:47] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[15:44:49] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s6 - https://phabricator.wikimedia.org/T281829 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1131 (171974662) before adding s6 to orchestrator
[15:45:06] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[15:45:17] <jynus>	 we don't really patch much, but we disable a lot of stuff (extra engines)
[15:45:46] <jynus>	 for the client I don't think it matters much
[15:47:11] <Amir1>	 btw, ruwikinews watchlist is now 1% of its original size, needs shrinking if you want to, it'll free up a couple of gigabytes in every host (the table was among the largest watchlist tables across fleet https://phabricator.wikimedia.org/P15523)
[15:59:32] <Amir1>	 on the not so bright news, with abstracting user table. Now we have +4K drifts reported. With revision it'll go higher
[16:10:02] <Amir1>	 I hope we can clean them up during the dc switchover
[16:35:29] <jynus>	 moritzm, I've left some completely untested 10.5 packages on apt1001:/home/jynus/bullseye
[16:46:02] <marostegui>	 jynus: <3
[16:46:56] <jynus>	 will send patches, but those are not intended for deploy or replace your work, just some ongoing testing, that didn't take me much from the backup testing I was already doing
[16:48:48] <jynus>	 there is some changes I don't know about on the server- column store, etc.
[16:55:24] <marostegui>	 ah I see 10.5 for bullseye, not 10.4 as moritzm was testing
[16:56:04] <jynus>	 yeah, I was working with 10.5 as you said it was ok
[16:56:26] <marostegui>	 sure sure :)
[16:56:38] <jynus>	 but as I said you can install what you want on cumin
[16:57:08] <marostegui>	 we should probably go for 10.4 in bullseye for now I think, as we haven't even finished migrating to it :(
[16:57:13] <jynus>	 sure
[16:57:49] <marostegui>	 although now without multisource migrations should be less painful
[16:57:59] <marostegui>	 as multisource was a source of unknowns 
[17:14:53] <jynus>	 also the reason was that I thought ongoing work on bullseye cumin was mostly for testing reasons
[19:14:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 4.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[19:17:25] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104