[06:06:30] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10Krinkle) [06:06:46] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10Krinkle) [08:29:43] marostegui, jynus: cumin1001 reboot good to go from a DB/backup perspective? [08:30:35] ok for me, there is 1 attached screen, tough 30077.T273359 [08:30:36] T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 [08:30:46] *but [08:31:15] that is probably either marostegui or kormat [08:32:11] ack [08:32:17] probably manuel, I think kormat uses tmux? [08:36:27] oh, an tmux also seems to have some running stuff [08:38:10] but they are just idle mysql connections, so that should be ok [08:38:50] and the attached connection is idle too [08:40:31] moritzm, manuel is on vacations today, so I am 99% we are ok to proceed [08:40:33] *sure [08:40:51] I don't see any activity, only people with open idle terminals [08:41:31] 👍 to reboot [08:41:44] moritzm, so go ahead [08:43:06] thanks! going aheads in ~ 5m [09:16:02] 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) FYI: ` [dbbackups]> select section, start_date, total_size, REPEAT('▄', total_size/20000000) as graph FROM backups where section='matomo'... [09:16:45] ^do you like my ascii-art prowess? [09:17:45] amazing :D [09:18:10] 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10elukey) @razzi can you check? :) [09:18:33] jynus: 🧑‍🎨 [09:20:43] not as nice as shlomi's pie charts ( http://code.openark.org/blog/mysql/sql-pie-chart ), but they get the job done :-) [09:21:27] whoa [09:35:47] kormat: hi, I originally thought you could deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/666216/, but fine if you don't want to :) [09:36:17] Urbanecm: ahh, i wasn't sure of the workflow/not very awake yet. i'm happy to deploy it. [09:36:59] kormat: thanks! We want to create the tables soon, and I believe we should wait for OK from DBAs before non-public tables are created :) [09:37:13] that sounds plausible :) [09:38:49] :) [09:38:52] thanks for your help kormat [09:45:19] Urbanecm: you're all set for table creation now [09:45:29] thank you kormat, appreciate it! [09:45:36] my pleasure :) [10:06:24] 10DBA, 10SRE: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dbmonitor2001.wikimedia.org` - dbmonitor2001.wikimedia.org (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - VM shutdown... [10:20:42] 10DBA, 10SRE, 10Patch-For-Review: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10Kormat) 05Open→03Resolved dbmonitor2001 was indeed unused, and is now decommissioned. [13:29:21] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) [13:31:49] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) [13:40:03] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) [[ https://github.com/wikimedia/puppet/blob/2ae5c9682b0eaa22f3ad17f3df27f3dedd6f8a50/modules/profile/manifests/mariadb/replication_lag.pp#L6-L13 | profile::mariadb::replication_... [13:41:20] if y'all are interested in naming things: https://phabricator.wikimedia.org/T275175#6852471 [13:42:57] godog: if you can frame it in terms of wording of OKRs, sobanski is your man. ;) [13:45:33] godog: Let's call it "misc storage", ms for short [13:45:36] Oh, wait... [13:45:39] ;) [13:45:56] lol kormat sobanski [13:46:24] time to test that utf8 hostnames after all (?) [13:47:12] I like ssoss so far but can't type it, sounds almost like 'sorry' so 'soz' for short [13:47:29] Does it have to be short? Can't we just do swift-misc? [13:50:49] moss - misc object storage service [13:51:08] yeah I'm a big advocate of short hostnames, we still type them a lot [13:51:30] godog: 👍 [13:52:20] yeah moss we can do I think, ties nicely with mr Ayoade [13:52:29] :> [13:52:36] +1 [13:52:44] And we can always turn it off and on again [13:52:59] godog: :D [13:53:05] hehe indeed [13:53:47] kormat: I think you should update the task and take the due credit :D [13:54:22] haha, done [13:55:32] other simple options: file1001 (pro: no other server starts with 'fi'), store1001 [13:56:03] oh wait, those are the frontends, my bad [13:56:06] I thought the backend [13:56:49] godog: nfs (not a filesystem), to reduce confusion [13:57:08] :-P [13:57:27] haahha yeah [13:58:02] one of those costly decisions that don't and can't show up on balance sheets [13:58:22] but yeah there will be the -fe / -be suffixes appended [13:58:59] going for 'moss' unless there are objections [13:59:12] missed opportunity there on object-ions [13:59:18] ah, then I restore my proposals: file-fe1001, store-fe1001 and I add object-fe1001 [14:00:11] but not, object-fe1001 is not an objection to moss :) [14:00:15] *but no [14:00:38] hehe yeah so far I think moss is the winner [14:03:12] ok going for 'moss' [14:39:26] godog, I had some plans to start the stress testing today, but I ran into some unexpected issues- [14:40:19] I found more "lost" files on the db, such as archived https://commons.wikimedia.org/wiki/File:%D0%AF%D1%83%D0%B7%D1%81%D0%BA%D0%B0%D1%8F_%D0%B1%D0%BE%D0%BB%D1%8C%D0%BD%D0%B8%D1%86%D0%B0._%D0%97%D0%B0%D0%BF%D0%B0%D0%B4%D0%BD%D1%8B%D0%B9_%D1%84%D0%BB%D0%B8%D0%B3%D0%B5%D0%BB%D1%8C..JPG [14:41:41] poor files :( [14:41:43] jynus: ack [14:42:29] so i am making sure I don't end up with null pointers by checking every way in which the db can be incomplete [14:44:12] I was expecting certain amount of errors, and just log/register on db those, but other are new for commons [15:23:29] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10Cmjohnson) 05Open→03Resolved This appears to have been done by @Jclark-ctr [15:47:09] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) I have now started 10 threads reading and retrieving commonswiki files to its temporary backup location at dbprov2003. dbprov2003 only has 8TB a... [15:48:03] ^godog I don't expect to cause any issues with just 10 threads, but just in case [15:48:53] I think I will hit cpu limits before swift io ones [15:49:11] jynus: ack, thank you for the heads up, LMK how it goes [15:49:44] initial throughout is reasonable, 300MBytes/s, but we will see if it sustains [15:55:10] yeah pretty sure you can crank things up on the swift side at least [15:55:57] the cluster has noticed a bit the start of operations: https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=84&orgId=1&var-site=codfw&var-cluster=swift&var-instance=All&var-datasource=thanos&from=1614052553853&to=1614095753853 [15:56:00] but not by much [16:01:19] I think, however, that we are getting a much better latency for reads [16:01:55] which means we may have to either depool swift when doing the real backup, or eat the slowdown due to user reads [16:07:14] I was very conservative in terms of memory- I am "only" using 3GB with 15 threads [16:07:41] although filesystem cache usage is very large [16:13:56] and errors so far are reported nicely: e.g. [2021-02-23 16:12:51,583] ERROR:backup Download of "commonswiki "Варила_я_рибку",_"Красоля,"_Полонне,_Україна.webm 7000b23c3602e1ca2893876b4e88ef7484251911" failed [16:14:19] ^this is because from the time it was detectd until it was downloaded, it had been renamed [16:16:21] 10DBA, 10SRE, 10ops-eqiad: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) I pulled the power and drained flea power, plugged back in and the server will not even power up. [16:21:53] :-( [16:22:51] 10DBA, 10SRE, 10ops-eqiad: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) Created a dispatch with Dell SR1052419298 [16:23:04] I am going to restart the backups, now using the SSDs [16:36:25] 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10crusnov) [16:36:28] 10DBA, 10SRE-tools, 10IPv6: Some Data Persistence DB clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271140 (10crusnov) [16:52:22] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [16:52:52] so the HDs show them being a bottleneck for random writes- moving to the SSD, the bottleneck moves to the network [16:53:44] I will leave the backup running for a while to see if network usage is sustained over a longer period of time [16:56:27] if it works well, tomorrow I will setup ms-be2016 to check backups will scale pseudo-linearly on the number of machines [17:32:24] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [19:16:31] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [19:44:40] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [19:46:14] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) >>! In T273568#6809673, @Marostegui wrote: > These hosts have been added to puppet with: `insetup` role and also assigned a partman recipe for the installation. > The on... [20:28:25] PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 20.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [20:29:41] RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [23:42:29] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [23:46:57] 10DBA, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2145.codfw.wmnet ` The log can be found in `/var...