[01:23:31] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [01:30:35] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [01:37:04] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) [01:37:43] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:04:45] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:09:41] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [05:06:20] I am going to restart tendril's mysql, due to high memory consumption [05:07:43] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) >>! In T282621#7094350, @Legoktm wrote: > How about 2021-05-19 06:00 UTC? Or any other day at that time > Works for me! [05:08:43] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) p:05Triage→03Medium [05:09:15] done [05:12:40] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1106.eqiad.wmnet'] ` The log can be found in... [05:28:01] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [05:28:06] 10DBA, 10Patch-For-Review: Move db2108 from s2 to s7 - https://phabricator.wikimedia.org/T282535 (10Marostegui) 05Open→03Resolved This is all done. [05:32:26] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) Announced: https://lists.wikimedia.org/hyperkitty/list/listadmins-announce@lists.wikimedia.org/thread/THQY2OJYW5NZIBFS3OK... [05:33:36] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1106.eqiad.wmnet'] ` and were **ALL** successful. [05:36:46] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) db1106 reimaged. Checking its tables. [05:38:22] 10DBA, 10Data-Services, 10decommission-hardware, 10cloud-services-team (Kanban): decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Marostegui) 05Stalled→03Open [05:52:56] 10DBA, 10Data-Services, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `labsdb1009.eqiad.wmnet`... [06:28:24] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [06:29:59] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) [06:32:02] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:32:05] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:32:29] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) 05Stalled→03Open [06:37:16] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [06:41:51] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1083.eqiad.wmnet` - db1083.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [06:42:32] 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) a:05Marostegui→03wiki_willy [06:43:10] 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) Ready for #dc-ops [06:43:18] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:11:14] Amir1: altering image table on commons is going to be fun... [07:11:47] is it even possible? [07:11:56] we'll see... [07:12:05] ALTER TABLE image MODIFY COLUMN img_timestamp BINARY(14) NOT NULL is now running [07:12:42] I am curious to see how many days it takes per host XD [07:13:48] look at the brightside, it'll save 6MB by dropping two bit from each field [07:14:13] XDDDD [07:15:13] one weird thing, the query of db size gives me 80GB for image, I don't know how it ended up as 360GB on disk [07:15:39] (with indexes ofc) [07:15:43] which query? [07:16:16] https://stackoverflow.com/questions/9620198/how-to-get-the-sizes-of-the-tables-of-a-mysql-database#9620273 [07:16:20] information schema [07:17:08] yeah, that is approximate, and the larger the table, the more aproximate it is [07:17:18] as it refreshes rarely [07:28:32] noted. Thanks. If we had some schema change shrinking it, I would suspect otherwise but I can't remember any [07:28:54] Amir1: we'll see once I finish this one, as it is rebuilding the table [07:47:08] Hi all. I'm a new GSoC student. Just saying hi and trying to find my mentors Jaime (jynus) and Manuel (marostegui) :D [07:47:33] hello hkrishna, congratulations [07:50:15] hkrishna, same! [07:50:39] Yay, good to meet you both and thank you [08:33:14] reminder that I will be unavailable in an hour or so for a couple of hours [08:38:36] Amir1, FYI mailman's db (m5) grew another 33% - not worrying at all - I am going to ack the alert [08:39:47] Thanks! [08:49:24] marostegui: ah, thanks for removing db1131 from the stretch futex task. i had only just thought of doing that [08:49:33] :) [09:03:38] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) [09:11:40] 10DBA, 10Data-Persistence-Backup, 10SRE-tools, 10User-Kormat: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10Kormat) [09:12:01] 10DBA, 10User-Kormat: wmfmariadbpy: Close connections gracefully - https://phabricator.wikimedia.org/T282989 (10Kormat) [09:34:10] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) [09:34:36] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) db1131 completed the databases check successfully, and is now being repooled. [09:36:08] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui) [10:16:22] <_joe_> hello database friends, I have a question for you [10:16:44] <_joe_> I need to tell kubernetes all the IP/ports it should open for mediawiki towards databases [10:17:32] <_joe_> now, for the metadata databases I assume I should just use what's in conftool-data as a basis [10:17:49] <_joe_> but for pc/es, how can I find a similar list of IPs? [10:18:10] _joe_: db-eqiad.php / db-codfw.php are the canonical places [10:18:42] <_joe_> marostegui: sorry I should've added, "in puppet" [10:18:44] We have only 3 active per DC + one spare (which can rotate depending on the needs) [10:18:45] ah [10:18:58] actually, es: they are on conftool [10:19:04] <_joe_> ack [10:19:25] For parsercache, they all have the role mariadb::parsercache [10:19:35] not sure if that is what you are looking after [10:19:41] <_joe_> btw, this will mean you'll have to make a deployment of mw every time you have to add a database from scratch [10:19:42] _joe_: query the service catalog™ [10:19:46] * volans runs [10:20:06] <_joe_> volans: that's what I am doing atm, the service catalog is puppet for this stuff :P [10:20:10] _joe_: what do you mean a deployment? [10:20:44] <_joe_> marostegui: in k8s, the difference between a puppet run and a deployment of the code practically disappears [10:21:09] <_joe_> so when you add a new db, you'll need to deploy a new kubernetes manifest to allow connecting to it [10:21:23] <_joe_> tbh I'm considering allowing connecting to any IP on the db ports [10:21:59] _joe_: that's painful :( [10:22:05] also for moving DBs around? [10:22:12] or just for new DBs (ie: new IPs) [10:22:21] <_joe_> marostegui: just adidng new IPs/ports [10:22:30] even if we use dbctl? [10:22:33] <_joe_> and it won't actually redeploy everything, just the networkpolicy [10:22:49] <_joe_> dbctl comes after you added a new db to conftool-data, right? [10:23:16] yes to: conftool-data/dbconfig-instance/instances.yaml [10:23:24] <_joe_> so right now you add to conftool-data, then enable in dbctl [10:23:29] yep [10:23:52] <_joe_> in my hypothesis, you'd have to add to conftool-data, run "deploy-to-kubernetes", enable in dbctl [10:24:09] <_joe_> the other alternative is [10:24:15] <_joe_> you give me a range of ports [10:24:27] <_joe_> and I give a free pass to connect to all those ports in the local DC [10:24:40] if that is for MW databases only, that's easier [10:25:04] it would be: 3306 and 3311-3320 [10:25:08] <_joe_> easier != better, but I think we can get there [10:26:36] <_joe_> this would all be simpler to manage if we had dedicated IP ranges for the various functions, something we never did really [10:28:28] yeah, we don't have that for DBs [10:28:45] I mean, we jus thave 10.64.% and 10.192.% [10:30:36] <_joe_> yeah :P [10:31:16] <_joe_> marostegui: so the port range is 3306-3320? [10:31:31] so we have 3306 for the hosts that only have 1 process [10:32:03] and then for the multi instance hosts, we have 331X being X the section, ie: s1: 3311, s2: 3312.... [10:32:16] so I would suggest we open till 3320 in case we create 2 more sections (unlikely) [10:35:25] <_joe_> so 3306 and then 3111-3120 [10:35:30] yeah [10:35:33] <_joe_> and if we create more sections, we'll add more [10:35:37] <_joe_> ack [10:35:42] yeah, for now we only use till 3318 [10:35:44] <_joe_> a patch will come your way today [10:35:51] <_joe_> wait what about wikitech [10:35:52] so we have room for s9 and s0 (we always talk about them) [10:36:16] good one..._joe_ so wikitech is supposed to go to s6 "soon" [10:36:21] But i have no idea if that is going to happen [10:36:45] it is now on s10, and I have no idea how it is handled on dbctl that's pending some discussion between andrew and cdanis XD [10:36:56] it uses 3306 though [10:37:04] and it will use 3306+3316 once moved [10:38:00] <_joe_> yeah I'm always confused about the status of wikitech [10:38:29] <_joe_> cdanis probably has more context, yes [13:46:58] marostegui: _joe_: so it looks like that we un-did the hardcoding that had originally been in place for labswiki / labstestwiki, and now they're s10 and s11 which are 'standard' sections, managed by dbctl and with the usual other supporting dblist scaffolding in mediawiki-config [13:47:10] it *used* to be special but looks like that was fixed in Nov 2019 [13:47:38] <_joe_> so we have s11 for what? [13:47:47] <_joe_> oh test wikitech [13:47:56] <_joe_> which is in production, I see [13:48:53] s11 might still be special actually [13:48:55] but s10 isn't [13:51:31] * _joe_ more confused [13:51:46] <_joe_> anyways! I decided with going with the broad cannon for now [13:51:59] yes ok [13:52:22] wikitech aka labswiki aka s10 is defined in etcd as normal [13:53:30] however labtestwiki aka s11 is special and hardcoded: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/f95bd0535ae49fb35466d8c7d90d969cead99c22/wmf-config/CommonSettings.php#315 [13:55:29] <_joe_> I guess we're keeping stuff interesting! [13:56:17] yeah [13:57:43] I'm not sure what the end state 'should' be here, but, that is the status qup [13:57:45] quo [13:58:37] <_joe_> well anyways, I think a lot of this stuff will have to be in sync with the rest of mediawiki once we move to kubernetes anyways [13:59:03] IIRC there was a desire to keep labstestwikitechlabs special, because I think it's hosted in codfw? [13:59:05] <_joe_> and/or when we decide to be stricter with ACLs [13:59:15] <_joe_> yeah that's gonna go too [13:59:41] <_joe_> I mean, unless we prefer to stop following the train again with wikitech [13:59:49] <_joe_> things will stop to look like this [14:07:49] welp, this is the first i've ever heard of s11 [14:09:11] cdanis: But what will happen with s10 in dbctl once we move wikitech out from it? [14:09:27] As far as I remember we still need to update db-eqiad.php and db-codfw.php and add labswiki pointing to s6 [14:09:33] No? [14:10:00] yes, we'll have to do that [14:10:19] cdanis: I am offline now, but would you have time to comment on the wikitech move doc regarding the steps needed from dbctl's side? [14:10:24] will do [14:10:27] Thanks [14:31:32] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) [14:51:26] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `db1085.eqiad.wmnet` - db1085.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found physical host... [14:52:08] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) [14:59:45] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [15:04:51] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [15:19:27] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) [16:15:57] when do you think are we going to get this level of automation? https://twitter.com/mschoening/status/1394675707695878148 [17:15:14] 10Blocked-on-schema-change, 10DBA: Schema change for making cuc_id in cu_changes unsigned - https://phabricator.wikimedia.org/T283093 (10Ladsgroup) [18:45:39] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [18:50:41] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [20:53:45] Toolforge replicas are ~15 hours lagged. Is this a known issue? https://replag.toolforge.org/ [20:53:51] s1, that is [22:47:59] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [22:57:43] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [23:35:07] musikanimal: very likely known, yes. This is a common thing to see when migrations are running in the prod cluster. The problem is not that the wiki replicas are not getting updated, it is instead that the s1 instance that is feeding into the sanitarium instances that do redaction before pushing updates to the wiki replicas are behind. [23:35:59] It would be nice to know if there is a way to see in tendril or elsewhere what is slowing the pipeline down. [23:36:47] ah, I see. Thanks for the explanation. Someone did create a task so I tagged Data-Services just in case https://phabricator.wikimedia.org/T283112 [23:37:45] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Identifying_lag kind of explains things, but maybe it could be more clear