[00:28:27] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) [04:35:03] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) db1121 is clean, replication restarted. [04:35:15] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [04:41:03] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Databases and users dropped. The DB side of things complete. [04:45:12] 10DBA, 10Patch-For-Review: Move db2108 from s2 to s7 - https://phabricator.wikimedia.org/T282535 (10Marostegui) transfer from db2121 to db2108 started. [04:56:52] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Ladsgroup) Wouldn't a message in tech news be enough? [05:03:24] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Marostegui) mysql stopped [05:03:27] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Marostegui) mysql stopped [05:03:31] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) mysql stopped [05:03:50] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Legoktm) 05Open→03Resolved a:03Legoktm All done then! [05:25:06] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [05:25:17] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [05:25:43] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [05:30:27] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) [05:36:34] Hi, we have some index hints from 2005 that I fee are not needed anymore. Is there a way to test it beside removing the hint and hoping for the best? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/684292/1/api/ApiQueryMessageTranslations.php [05:38:31] testing the query itself [05:44:22] 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) Progress: 22.68% [05:46:54] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#7069904, @Ottomata wrote: > Interesting thanks! So brainstorming how... [06:20:56] 10DBA: Move db2108 from s2 to s7 - https://phabricator.wikimedia.org/T282535 (10Marostegui) The transfer is done and the host is now in s7. Leaving this task open until the check tables is finished on both db2121 and db2108 [06:25:43] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) db1121 is being automatically repooled [06:33:53] 10Data-Persistence-Backup: Setup backup1003 and backup2003 as the storage location for es bacula backups - https://phabricator.wikimedia.org/T282249 (10jcrespo) backup1002 and backup2002 backups are still running, with, as expected, backup1002 being quite behind. Nicely this extra work doesn't affect other back... [06:59:24] 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) Host depooled [07:01:34] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) [07:06:10] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) [07:06:34] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) [07:29:45] how different is mysql 5.6/5.7 from the MariaDB that we run (10.4 I gather)? would it be beneficial for me to ask Mailman3 upstream to enable CI against some MariaDB version? [07:46:22] none of the issues you asked about were related to mariadb vs mysql [07:46:37] it was due to relying on very old versions of the database [07:49:33] e.g. current stable mysql version has utf8mb4_0900_ai_ci as the default collation [07:50:41] that is, since 2018 [08:37:23] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) [08:53:20] 10DBA, 10Patch-For-Review: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [08:54:25] 10DBA, 10Patch-For-Review: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [08:57:19] 10DBA, 10Patch-For-Review: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [09:15:24] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) [09:17:43] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) [09:20:06] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1074.eqiad.wmnet` - db1074.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [09:20:51] 10DBA, 10decommission-hardware, 10ops-eqiad: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) [09:20:56] 10DBA, 10decommission-hardware, 10ops-eqiad: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Marostegui) This is ready for #dc-ops [09:21:50] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:00:26] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Marostegui) [10:00:29] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) [11:28:22] moritzm: +1ed https://gerrit.wikimedia.org/r/c/operations/puppet/+/689761 I don't expect issues, but just fyi, I am out tomorrow and friday, so I won't be monitoring this cc jynus (in case you notice something on tendril) [11:38:51] one thing before you go [11:39:19] I saw some issues on cumin2002 with the packages [11:40:27] mysql.py returns "ModuleNotFoundError: No module named 'wmfmariadbpy.dbutil'" [11:42:02] and wmf-mariadb104-client had some problems, not sure if solved already [11:43:44] not asking for a fix as it is unlikely it will be needed this week, but is there something we should do about it? [11:44:12] which issues had the 10.4 package? [11:44:23] it seemed to work fine on my env [11:44:25] I think it couldn't be installed [11:44:41] really? that's odd [11:44:48] The following packages have unmet dependencies: [11:44:53] wmf-mariadb104-client : Depends: libreadline5 but it is not installable [11:45:09] mmmm I wonder why it worked on my env [11:45:18] is it an upgraded buster? [11:45:24] maybe it retained old packages [11:45:25] no, it is a bullseye [11:45:27] from scratch [11:45:53] let me check if I chanded that for 10.5 or stayed the same [11:46:03] https://packages.debian.org/bullseye/libreadline8 [11:46:06] marostegui: bear in mind that bullseye is still debian _testing_. so it's likely to be in a state of flux. [11:46:08] maybe it needs to be libreadline8? [11:46:22] kormat: that too, indeed [11:46:26] I did a netinstall [11:46:41] I linked it to that: https://gerrit.wikimedia.org/r/c/operations/software/+/685524/2/dbtools/control-mariadb-client-10.5 [11:46:41] (does make me wonder why we're testing an unreleased distro) [11:46:56] kormat, not my choice [11:47:09] but apparently they are going to deploy that in weeks [11:47:14] and I don't want backups to stop [11:47:46] jynus: will change that next week then and recompile the package [11:48:08] and remove existing cumin2001, I was told [12:04:27] jynus: I am compiling the new 10.4 client for bullseye [12:05:02] as I said, I didn't need a patch, just wanted your opinion before you went away [12:05:14] I can wait until monday [12:06:15] I am still not leaving for the day :) [12:06:21] ok :-) [12:06:51] and as I said before, the 10.5 was before I knew this was soon to be in production [12:07:21] maybe we should wait before merging: https://gerrit.wikimedia.org/r/c/operations/puppet/+/665324 [12:19:02] I left the new client package at apt1001:/home/marostegui/bullseye/10.4 [12:26:02] jynus: the CR you linked is mine? it's waiting for the wmfmariadbpy 0.7 release next week [12:26:19] (hence my -2 on it) [12:26:20] sorry [12:26:23] wrong link [12:26:37] I meant to paste: https://gerrit.wikimedia.org/r/c/operations/puppet/+/686393 [12:26:54] the one upgrading 10.5 client, I don't think we should do that yet [12:27:25] jynus: I think we can deploy that for now for testing and later revert and install to 10.4 once we are happy about 10.4 package [12:27:32] I don't want to block moritzm on that sense [12:27:47] that is basically what I wanted to ask you before you went away [12:28:20] we could either upload the new 10.4 package or upgrade to 10.5 [12:28:20] see my last comment on that patch, I think we can move forward with that if that sounds good [12:28:48] up to you, :-) I really want you to decide - we can always revert [12:28:58] or change mind [12:29:00] then let's stick to the +1 :) [12:29:17] once we are happy with the 10.4 package we can change that (or if we see issues when connecting from 10.5 client to 10.4 server) [12:29:31] the thing is, I haven't tested 10.5 either [12:29:45] for all I know, it could have the same issues [12:29:55] but we can merge and see! [12:30:16] if you want to try the last package I uploaded maybe that'll solve the issue you saw before [12:30:18] I will ask moritz, puppet may be disabled now [12:30:39] ok, so at the very least we will have both options in case one fails [12:32:09] if it was for me, I would delay the whole cumin2002 thing, seems rushed to me [12:32:23] +1 yes [12:32:44] it is a lot of untested packages on my side, and probably on yours too [12:32:59] all of them are untested: ) [12:33:30] and my packages are not great to start [12:34:48] I'd prefer to get rid of stretch first [12:34:57] before thinging about bullseye [12:36:01] There is also the situation of bullseye + 10.4 or bullseye + 10.5, and I rather go first for bullseye + 10.4 cause it is one less thing to test [12:36:08] +1 [12:36:26] again, I was playing with 10.5 only because it was for the far future [12:37:18] (and technically what you told me to test with, but thinking in a year's time or so, not now) [12:37:35] and for xtrabackup, not for the server [12:38:20] yeah, I asked you to test 10.5 for the future, as we have 10.4 pretty much everywhere :) [12:38:26] so better to get some 10.5 testing ahead :) [12:38:46] and that is when I stressed about cumin2002 0:-) [12:39:07] maybe we can push back a little? [12:39:47] e.g. maybe cumin2002 could be setup but keep backups on cumin2001 [12:39:54] and the mysql staff [12:40:00] +1 yes [12:40:07] It is too early to remove cumin2001 yes [12:40:40] that way we will have time to test, and actually wait for a proper distro release [12:41:22] (this is exactly what I wanted to discuss with you, not making you update packages in a rush) [12:41:28] I just saw moritzm email :) [12:41:51] Buf, I think replacing also cumin1001 "soon" is definitely way too fast for us [12:42:12] I wouldn't like cumin2001 to be replaced before the switchover to be honest [12:42:18] if it is hw related, maybe we can setup a temporary cumin2003 for us? [12:42:31] that's another host to maintain :( [12:42:52] don't know, I think this needs some broader conversation with moritzm to see what are the expectations and timelines [12:42:58] This is way too fast for us [12:43:15] "I am going to build a new cumin host, and foundations is going to pay for it" [12:43:29] aka maintain? [12:44:01] I think we need to involve foundations first and see what their thoughts are [12:44:06] sure [12:44:12] maybe they are ok delaying this, especially for 1001 [12:44:14] I am just thinking of alternatives [12:44:24] going to add this point to our weekly for monday [12:44:28] so we can discuss as a team [12:44:36] thank you [12:55:35] marostegui: I can relay the question to our team meeting that is later today. I don't think there is any particular rush for cumin1001, and that it will wait the official release of bullseye anyway. Cumin2001 too can live a little longer if necessary IMHO. [12:56:21] 10DBA, 10SRE, 10Traffic: dbtree.wm.o stopped working after enforcing Puppet CA issued certs for ATS backend origin servers - https://phabricator.wikimedia.org/T282531 (10jbond) 05Open→03Resolved a:05Vgutierrez→03jbond This is working again [12:57:41] added question to the pad, will let you know later [12:58:48] volans: thanks, I added it to our weekly on monday. But definitely we need more time before retiring cumin1001 (and I would like to wait for cumin2001 so it is done after the switchover, as we move to cumin2XXX once we are in codfw as active) [12:59:10] ack [13:00:00] 10DBA, 10SRE, 10Traffic: dbtree.wm.o stopped working after enforcing Puppet CA issued certs for ATS backend origin servers - https://phabricator.wikimedia.org/T282531 (10Marostegui) Thank you both! [13:00:00] for me it helps to have both, to test properly backups under the new env [13:00:07] jynus: i see the problem with the python-wmfmariadbpy package on bullseye. i'll need to figure out what the right solution is there. [13:00:35] thanks, kormat but hopefully we get the flexibility we need [13:00:48] that's orthogonal :) [13:00:49] to not having to rush it [13:00:53] yep [13:01:05] but last think I wanted was to stress you over it [13:01:15] (because I got stressed) [13:01:35] there's no way i'm getting stressed about trying to support an unreleased version of debian. :) [13:01:41] he he [13:02:04] volans: dbtree was fixed by john and valentin :) [13:02:53] marostegui: thanks! great [13:19:44] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Right, I understand that the extra maintenance this would cause could be too onerous t... [13:45:22] m5 traffic here is adorable compared to core dbs https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=misc&var-shard=All&var-role=All&from=now-30d&to=now [13:45:37] but you can see the effect of mailman3 launch [13:46:13] (it'll stabilize once we upgrade all mailing lists, it's mostly done though) [13:50:21] Amir1: it made m5 hosts to wake up and have to work: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-30d&to=now&var-site=eqiad&var-group=misc&var-shard=m5&var-role=All [13:51:01] lazy m5 [13:53:39] it actually made the backups to alarm because unusual data growth. I guess that will happend until the end of the migration. [13:55:04] mysql.py now works on cumin2002. [13:59:17] jynus: is there any size numbers so far? [14:04:03] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:09:39] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:17:11] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:17:45] tahnks kormat, although probably not at 100%, because of the last email I sent [14:18:11] Amir1: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-05-11 04:18:31 is 25 GB, but previous one was 17 GB, a change of 52.2% [14:19:45] so 8GB for mailman3, my estimate is around 30GB in total [14:20:23] I can check [14:20:39] yeah, let's wait for upgrades to finish [14:20:42] some left [14:21:00] no issue with the size btw [14:22:59] 24GB on disk right now, mostly the web db [14:23:22] out of 164G [14:23:40] but size on disk is very missleading [14:28:35] web has the actual emails and attachments [14:29:08] oh, really? I thought it would only have like a "rendered" version of them [14:29:44] it makes sense why it is easier to manage [14:50:51] yeah, I think it's better because then we can have HA and such (multiple mailman instances for example) [14:51:08] the search index is in the VM though [14:51:20] but that's secondary data [14:55:28] marostegui: sure, cumin2001 can stay around until the switchover is complete, the hardware is still working fine after all [14:58:17] and by then bullseye will be a full release anyway [15:16:34] moritzm, he is gone for the week [15:16:43] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10LSobanski) a:03Marostegui Assigning to Manuel for review, can be moved to Blocked afterwards until the announcement goes out. [15:17:09] I will be leaving early today, I am blocked on a backup to finish running so I will return when it is done [15:23:01] ack [16:25:38] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) >>! In T282621#7080773, @Ladsgroup wrote: > Wouldn't a message in tech news be enough? tbh unless we expect it to be lik... [16:27:45] jynus: ack, thanks [23:13:52] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10ppelberg) @Marostegui: with https://gerrit.wikimedia.org/r/683067 being merged, would it be accurate for us to consider t...