[06:06:30] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10Krinkle)
[06:06:46] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10Krinkle)
[08:29:43] <moritzm>	 marostegui, jynus: cumin1001 reboot good to go from a DB/backup perspective?
[08:30:35] <jynus>	 ok for me, there is 1 attached screen, tough 30077.T273359
[08:30:36] <stashbot>	 T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359
[08:30:46] <jynus>	 *but
[08:31:15] <jynus>	 that is probably either marostegui or kormat 
[08:32:11] <moritzm>	 ack
[08:32:17] <jynus>	 probably manuel, I think kormat uses tmux?
[08:36:27] <jynus>	 oh, an tmux also seems to have some running stuff
[08:38:10] <jynus>	 but they are just idle mysql connections, so that should be ok
[08:38:50] <jynus>	 and the attached connection is idle too
[08:40:31] <jynus>	 moritzm, manuel is on vacations today, so I am 99% we are ok to proceed
[08:40:33] <jynus>	 *sure
[08:40:51] <jynus>	 I don't see any activity, only people with open idle terminals
[08:41:31] <kormat>	 👍 to reboot
[08:41:44] <jynus>	 moritzm, so go ahead
[08:43:06] <moritzm>	 thanks! going aheads in ~ 5m
[09:16:02] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) FYI: ` [dbbackups]> select section, start_date, total_size, REPEAT('▄', total_size/20000000) as graph FROM backups where section='matomo'...
[09:16:45] <jynus>	 ^do you like my ascii-art prowess?
[09:17:45] <elukey>	 amazing :D
[09:18:10] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10elukey) @razzi can you check? :)
[09:18:33] <kormat>	 jynus: 🧑‍🎨
[09:20:43] <jynus>	 not as nice as shlomi's pie charts ( http://code.openark.org/blog/mysql/sql-pie-chart ), but they get the job done :-)
[09:21:27] <kormat>	 whoa
[09:35:47] <Urbanecm>	 kormat: hi, I originally thought you could deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/666216/, but fine if you don't want to :)
[09:36:17] <kormat>	 Urbanecm: ahh, i wasn't sure of the workflow/not very awake yet. i'm happy to deploy it.
[09:36:59] <Urbanecm>	 kormat: thanks! We want to create the tables soon, and I believe we should wait for OK from DBAs before non-public tables are created :)
[09:37:13] <kormat>	 that sounds plausible :)
[09:38:49] <Urbanecm>	 :)
[09:38:52] <Urbanecm>	 thanks for your help kormat 
[09:45:19] <kormat>	 Urbanecm: you're all set for table creation now
[09:45:29] <Urbanecm>	 thank you kormat, appreciate it!
[09:45:36] <kormat>	 my pleasure :)
[10:06:24] <wikibugs>	 10DBA, 10SRE: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dbmonitor2001.wikimedia.org` - dbmonitor2001.wikimedia.org (**PASS**)   - Downtimed host on Icinga   - Found Ganeti VM   - VM shutdown...
[10:20:42] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10Kormat) 05Open→03Resolved dbmonitor2001 was indeed unused, and is now decommissioned.
[13:29:21] <wikibugs>	 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat)
[13:31:49] <wikibugs>	 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat)
[13:40:03] <wikibugs>	 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) [[ https://github.com/wikimedia/puppet/blob/2ae5c9682b0eaa22f3ad17f3df27f3dedd6f8a50/modules/profile/manifests/mariadb/replication_lag.pp#L6-L13 | profile::mariadb::replication_...
[13:41:20] <godog>	 if y'all are interested in naming things: https://phabricator.wikimedia.org/T275175#6852471
[13:42:57] <kormat>	 godog: if you can frame it in terms of wording of OKRs, sobanski is your man. ;)
[13:45:33] <sobanski>	 godog: Let's call it "misc storage", ms for short
[13:45:36] <sobanski>	 Oh, wait...
[13:45:39] <sobanski>	 ;)
[13:45:56] <godog>	 lol kormat sobanski 
[13:46:24] <godog>	 time to test that utf8 hostnames after all (?)
[13:47:12] <godog>	 I like ssoss so far but can't type it, sounds almost like 'sorry' so 'soz' for short 
[13:47:29] <sobanski>	 Does it have to be short? Can't we just do swift-misc?
[13:50:49] <kormat>	 moss - misc object storage service
[13:51:08] <godog>	 yeah I'm a big advocate of short hostnames, we still type them a lot
[13:51:30] <kormat>	 godog: 👍
[13:52:20] <godog>	 yeah moss we can do I think, ties nicely with mr Ayoade
[13:52:29] <sobanski>	 :>
[13:52:36] <sobanski>	 +1
[13:52:44] <sobanski>	 And we can always turn it off and on again
[13:52:59] <kormat>	 godog: :D
[13:53:05] <godog>	 hehe indeed
[13:53:47] <godog>	 kormat: I think you should update the task and take the due credit :D
[13:54:22] <kormat>	 haha, done
[13:55:32] <volans>	 other simple options: file1001 (pro: no other server starts with 'fi'), store1001
[13:56:03] <volans>	 oh wait, those are the frontends, my bad
[13:56:06] <volans>	 I thought the backend
[13:56:49] <volans>	 godog: nfs (not a filesystem), to reduce confusion
[13:57:08] <volans>	 :-P
[13:57:27] <godog>	 haahha yeah
[13:58:02] <godog>	 one of those costly decisions that don't and can't show up on balance sheets
[13:58:22] <godog>	 but yeah there will be the -fe / -be suffixes appended
[13:58:59] <godog>	 going for 'moss' unless there are objections
[13:59:12] <godog>	 missed opportunity there on object-ions
[13:59:18] <volans>	 ah, then I restore my proposals: file-fe1001, store-fe1001 and I add object-fe1001
[14:00:11] <volans>	 but not, object-fe1001 is not an objection to moss :)
[14:00:15] <volans>	 *but no
[14:00:38] <godog>	 hehe yeah so far I think moss is the winner
[14:03:12] <godog>	 ok going for 'moss'
[14:39:26] <jynus>	 godog, I had some plans to start the stress testing today, but I ran into some unexpected issues-
[14:40:19] <jynus>	 I found more "lost" files on the db, such as archived https://commons.wikimedia.org/wiki/File:%D0%AF%D1%83%D0%B7%D1%81%D0%BA%D0%B0%D1%8F_%D0%B1%D0%BE%D0%BB%D1%8C%D0%BD%D0%B8%D1%86%D0%B0._%D0%97%D0%B0%D0%BF%D0%B0%D0%B4%D0%BD%D1%8B%D0%B9_%D1%84%D0%BB%D0%B8%D0%B3%D0%B5%D0%BB%D1%8C..JPG
[14:41:41] <godog>	 poor files :(
[14:41:43] <godog>	 jynus: ack
[14:42:29] <jynus>	 so i am making sure I don't end up with null pointers by checking every way in which the db can be incomplete
[14:44:12] <jynus>	 I was expecting certain amount of errors, and just log/register on db those, but other are new for commons
[15:23:29] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10Cmjohnson) 05Open→03Resolved This appears to have been done by @Jclark-ctr
[15:47:09] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) I have now started 10 threads reading and retrieving commonswiki files to its temporary backup location at dbprov2003. dbprov2003 only has 8TB a...
[15:48:03] <jynus>	 ^godog I don't expect to cause any issues with just 10 threads, but just in case
[15:48:53] <jynus>	 I think I will hit cpu limits before swift io ones
[15:49:11] <godog>	 jynus: ack, thank you for the heads up, LMK how it goes
[15:49:44] <jynus>	 initial throughout is reasonable, 300MBytes/s, but we will see if it sustains
[15:55:10] <godog>	 yeah pretty sure you can crank things up on the swift side at least
[15:55:57] <jynus>	 the cluster has noticed a bit the start of operations: https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=84&orgId=1&var-site=codfw&var-cluster=swift&var-instance=All&var-datasource=thanos&from=1614052553853&to=1614095753853
[15:56:00] <jynus>	 but not by much
[16:01:19] <jynus>	 I think, however, that we are getting a much better latency for reads
[16:01:55] <jynus>	 which means we may have to either depool swift when doing the real backup, or eat the slowdown due to user reads
[16:07:14] <jynus>	 I was very conservative in terms of memory- I am "only" using 3GB with 15 threads
[16:07:41] <jynus>	 although filesystem cache usage is very large
[16:13:56] <jynus>	 and errors so far are reported nicely: e.g. [2021-02-23 16:12:51,583] ERROR:backup Download of "commonswiki "Варила_я_рибку",_"Красоля,"_Полонне,_Україна.webm 7000b23c3602e1ca2893876b4e88ef7484251911" failed
[16:14:19] <jynus>	 ^this is because from the time it was detectd until it was downloaded, it had been renamed
[16:16:21] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) I pulled the power and drained flea power, plugged back in and the server will not even power up.
[16:21:53] <jynus>	 :-(
[16:22:51] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) Created a dispatch with Dell SR1052419298
[16:23:04] <jynus>	 I am going to restart the backups, now using the SSDs 
[16:36:25] <wikibugs>	 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10crusnov)
[16:36:28] <wikibugs>	 10DBA, 10SRE-tools, 10IPv6: Some Data Persistence DB clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271140 (10crusnov)
[16:52:22] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[16:52:52] <jynus>	 so the HDs show them being a bottleneck for random writes- moving to the SSD, the bottleneck moves to the network
[16:53:44] <jynus>	 I will leave the backup running for a while to see if network usage is sustained over a longer period of time
[16:56:27] <jynus>	 if it works well, tomorrow I will setup ms-be2016 to check backups will scale pseudo-linearly on the number of machines
[17:32:24] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[19:16:31] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[19:44:40] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[19:46:14] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) >>! In T273568#6809673, @Marostegui wrote: > These hosts have been added to puppet with: `insetup` role and also assigned a partman recipe for the installation. > The on...
[20:28:25] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 20.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104
[20:29:41] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104
[23:42:29] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[23:46:57] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2145.codfw.wmnet ` The log can be found in `/var...