[02:33:26] 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Krinkle) >>! In T280605#7060441, @gerritbot wrote: > %%%[mediawiki/extensions/DiscussionTools@master] Allow talk pages to... [02:42:24] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Krinkle) [03:18:17] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team, 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Krinkle) [03:38:33] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Ladsgroup) I agree, it's just better safe than sorry. Maybe it errors out and DBAs need some time (=complications). Definitely not... [03:40:27] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team, 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Krinkle) @Marostegui @Kormat @jcrespo I need from you a simple explanation or best-guess about what the story... [06:37:06] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team, 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10jcrespo) you heard you saying for the second time: > it does not use transactions But **every query is a trans... [07:20:05] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) >>! In T120242#6945584, @Ottomata wrote: > Another idea that may not be feasible: Would it... [08:23:52] kormat, jynus: FYI I've updated cumin on cumin2002 (there wasn't anything running there), to be 100% sure it's all good before upgrading cumin1001 too is there a quick way to check that your side of the automation and backups works fine? The library API have no backward incompatibilities, if there is any it's a bug ;) [08:24:51] just run transfer.py [08:27:39] ack, any suggested params to use for a safe test run? I've never run it :) [08:28:10] and looking at wikitech and doc.w.o I don't see an example of a safe/noop command [08:29:16] there is really no noop command, other than --help, just transfer some random file on your home somewhere [08:29:41] that should use the remoteExecution api taht uses cumin [08:30:30] ack [08:33:25] looks to work fine [08:37:48] cool [08:38:47] what we can do next is migrate 1 backup job to the new host, so we can check the full process [08:39:29] I tested on cumin2001 (buster, new cumin), so that I can upgrade cumin on cumin1001 too [08:40:05] the bullseye host AFAIK has still apt issues with the mariadb client [08:40:09] (cumin2002) [08:40:23] ah, true [08:40:41] I guess we still can run a manual run of a backup [08:40:57] let me do that on a screen on cumin2001 [08:41:28] ack, thx [08:41:57] I think there are stuff that uses remoteexecution on wmfmariadbpy, but you will have to ask kormat [08:42:15] I'm 99.9% sure all works fine [08:42:50] but you know, there's always that one case... [08:43:23] I trust your code, I don't trust mine :-) [08:44:34] I don't trust any code, mine in particular ;) [08:45:08] the backup will also work to test a 0.4->0.5 upgrade on the wmfbackups, which should be a noop [08:45:21] ack [08:46:44] it is currently running [08:46:54] I have a question for you about cumin [08:46:59] if you have 1 minute [08:47:33] sure [08:48:24] it does not accept as remote execution content an array, only a string, right? [08:48:40] as in e.g. ['ls', '-l'] [08:49:21] no, just a string, as it will be passed to ssh by clustershell [08:49:48] but a ' '.join(['ls', '-l']) should be enough to convert it [08:50:00] yes, that is what I do [08:50:49] btw, with the new cumin you should be able to get rid of the /dev/null trick [08:51:27] it is now possible to tell cumin to print or not independently both the command output and the progress bars [08:51:38] but, and this is a bug on my side, would you have any suggestion to go about: https://phabricator.wikimedia.org/T256749 [08:52:47] I should check the code, but you could quote the paths [08:52:51] with single quotes [08:53:16] so tht the final ssh command will have '/home/jynus$(rm /home/jynus/vuln)' as path [08:53:30] including the quote [08:56:50] I will try, but I had issues always being able to get to shell (e.g. you can add single quetes there) [08:58:31] I will create a ticket to track the revert of the output code [09:00:02] sorry meeting, bbl [09:10:37] 10DBA, 10Data-Persistence-Backup, 10SRE-tools: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10jcrespo) [09:20:03] arturo, labsdb1009/10/11 are failing to collect metrics. Should those be removed from monitoring (tendril/zarcillo)? [09:26:54] mmmm jynus I'm not sure. Are they wiki replicas? Better to open a phab task to help clarify the situation [09:27:22] do you know if there is one for labsdb decommission or update? [09:27:56] I see, there is T282522 [09:27:57] T282522: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 [09:28:33] I will let him handle next week [09:28:39] *it [09:54:17] 👍 [10:10:19] 10DBA, 10Data-Persistence-Backup, 10SRE-tools: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10LSobanski) p:05Triage→03Low [10:51:17] volans: [09:43:39]: INFO - Backup finished correctly / Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2021-05-13 09:13:05 (280 GB) [11:00:24] great! [11:03:00] 10DBA, 10Data-Persistence-Backup, 10SRE-tools: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10Volans) Version is 4.1.0, currently latest on apt.w.o and PyPI and deployed to all cumin hosts in production. See `worker.... [11:03:06] thanks for testing! [11:40:40] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:47:58] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [12:57:09] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > it would make the database availability depend on the availability of eventgate, and... [13:22:24] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team, 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10LSobanski) p:05Triage→03High [13:27:58] 10DBA, 10DiscussionTools, 10Performance-Team, 10Editing-team (FY2020-21 Kanban Board), and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Esanders) [21:21:48] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team, 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10dpifke) Have we considered using table partitioning to make the purge less expensive? If we partitioned based...