[05:11:51] 10DBA, 10Commons, 10MediaWiki-Database, 10MediaWiki-Special-pages, and 3 others: Special:Log/Fanghong results in fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError" - https://phabricator.wikimedia.org/T199790 (10Marostegui) 05Open>03Resolved a:03Marostegui The partitioning has finished and... [08:10:33] we have issues with es1019 [08:10:56] can I help? [08:11:20] /dev/sda1 was not cleanly unmounted, check forced. [08:11:33] /dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. [08:11:52] gah! [08:12:11] at least I unmounted /src before rebooting [08:12:16] but the reimage failed [08:12:42] broken disk? does the hw logs say something? I can check if you didn't already [08:13:57] disks used to be fine [08:14:25] are you running fsck to see what it reports? [08:14:30] I already did [08:14:33] also badblocks -v would be a good one to run [08:14:36] it is only for / [08:14:47] which I don't care because I am going to format [08:15:05] yeah, I am just worried in case it could be a sign of storage misbehaving [08:15:34] it is certainly the same raid storage [08:29:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [08:29:44] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [08:29:47] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [08:33:48] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) s2 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1122 [] db1105 [] db1103 [] db1090 [] db1076 [] db1074 [] db1066 [08:33:50] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) s2 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1122 [] db1105 [] db1103 [] db1090 [] db1076 [] db1074 [] db1066 [08:33:53] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) s2 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1125 [] db1122 [] db1105 [] db1103 [] db1090 [] db10... [10:26:29] [cross-posting] as heads up we're planning to upgrade cumin on sarin shortly, so use neodymium instead if you need to run anything cumin-related. We'll advice when we're about to upgrade neodymium too [13:27:30] I am testing now transfering 102GB from es2001 to es2002 [13:28:11] cool [13:29:08] it is not fast, there are ways to improve speed, I think [13:29:23] (it does an md5sum before and after) [13:30:05] that is pretty useful [13:30:19] yes, but it doesn't start transfering right away [13:30:44] yeah, but that is not a super thing, specially for automatic provisioning, where 1-2 hours more doesn't really matter [13:30:47] but if it is fully unatended [13:30:53] exactly [13:30:57] if we think in a process where we can provision new nodes overnight [13:30:58] we can optimice later [13:31:00] or stuff like that [13:31:01] yeah [13:31:16] also it may compress allways [13:31:21] even .gzs [13:33:27] It spent >7 minutes calculating 100GB of md5sum [13:34:52] now transferring [13:36:33] so a normal dataset of 800,900gb… it will not bit a lot [13:36:36] I think it is worth having it [13:36:41] To ensure a healthy transfer [13:40:13] yes, but the initial one could happen at the same time than the transfer [13:40:23] there is room for improvement [13:40:46] yeah, step by step [13:43:53] I would recommend you to get familized with it [13:44:08] it is not yet deployed well, but I have it on my home on sarin [13:44:12] sure [13:44:15] happy to use it! [13:44:47] and I just ran [13:44:54] sudo ./transfer.py es2001.codfw.wmnet:/srv/backups/latest$ dump.s1.2018-07-17--17-51-56 es2002.codfw.wmnet:/srv/backups [13:45:01] Yeah, I was just checking the usage [13:45:15] I meant [13:45:17] sudo ./transfer.py es2001.codfw.wmnet:/srv/backups/latest/dump.s1.2018-07-17--17-51-56 es2002.codfw.wmnet:/srv/backups [13:45:27] * marostegui copying that to his notes [13:45:32] if it is a file, it transfers it [13:45:44] if it is a dir, it transfers the dir recursively [13:47:49] I didn't time the start, but it took around 7m30s to do the md5sums, and then started transfering around 13:33 [13:48:26] Nice [13:58:05] you mentined 4 hosts on eqiad currently available, which ones? the 4 old sanitariums (95, 1102, 1116, 1120?) [13:58:22] I think some of those should go to x1 at least [13:58:31] and codfw-misc [13:58:50] were you thinking some other? [13:59:18] for codfw we don't have available servers [13:59:27] oh, I see [13:59:33] they are eqiad all, true [13:59:52] And yes, at least one eqiad should go to x1 [13:59:57] Agree with that [14:01:23] I will take the oldest 2, test the transfer script and set them up with test copies of production for also testing automatic master promotion [14:01:40] that should be 95 and 1102 I think [14:01:45] yes [14:02:00] Should I take 1116 and move it to x1? [14:02:31] if it is something you would enjoy, for a change, yes [14:02:41] sure [14:02:42] but I don't consider it high prority right now [14:02:45] yeah [14:02:53] not going to do it right now [14:03:13] as in, do it if you want to do it, if not, it can wait [14:03:21] I will "reserve" it on the task so we don't step on each other toes [14:03:31] I can do that [14:03:35] ah ok :) [14:03:40] I will be productonizind the others [14:03:42] sure [14:05:58] [cross posting] so far all good on sarin, we're planning to upgrade cumin/reimage/debdeploy/etc. on neodymium too, any blocker in progress? [14:06:14] volans: ok for me! [14:06:18] not for me [14:06:35] manuel may have some screens in progress, but I guess nothing related to cumin [14:06:43] I am testing a lot cumin on sarin myself [14:06:56] yeah, nothing affected by cumin [14:07:04] I saw the screens but seems just alters [14:07:08] yeh :) [14:07:10] I finished the reimage this morning [14:07:21] perfect, thanks [14:07:29] volans: good luck! [14:07:55] (I am not going to reimage those spares, as they will not be used for production) [14:12:42] 10DBA, 10Core-Platform-Team, 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 (10Marostegui) [14:25:29] [cross-posting] cumin/debdeploy/reimage/other-tools upgrade completed, let me know if you encounter any issue [14:30:56] marostegui: good enough? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446827/2/manifests/site.pp [14:31:57] checking [14:35:12] I just forgot of one thing we still have to do after s1 switch, the candidate switch [14:36:27] Nope [14:36:28] It is tracked :) [14:36:30] check this [14:36:50] https://phabricator.wikimedia.org/T199861 [14:36:54] 3rd line :) [14:37:59] cool, thanks [14:38:12] I thought it died on the switchover task [14:38:43] it was too messed already, I thought we should better track it after we are ready to decom db1052 :) [14:38:51] yes, thank you [15:58:02] 10DBA, 10Core-Platform-Team, 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 (10Jdforrester-WMF) [16:20:16] 10DBA, 10Data-Services: Make Dispenser's principle_links table accessible in new Wiki replica cluster - https://phabricator.wikimedia.org/T180636 (10Framawiki) Ping @Dispenser :) [16:47:01] jynus: do you remember which bug you were referring to in https://phabricator.wikimedia.org/T192875#4322491 and if there is already a task? [17:00:08] myloader / recover_section.py doesn't recover empty databases [17:00:29] aka our current backup/recovery system [17:01:16] https://github.com/maxbube/mydumper/issues/110 [17:01:29] https://bugs.launchpad.net/mydumper/+bug/1558164 [17:11:15] ah that one only, ok, do you want a task on phab too? [17:59:26] 10DBA: DB backup restore skip empty databases - https://phabricator.wikimedia.org/T200035 (10Volans) [17:59:35] done ^^^ [17:59:59] close the original one, then [18:00:51] yeah was about to [18:00:54] 10DBA, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875 (10Volans) Resolving as debmonitor is in production and the restore issue is tracked in T200035. Thanks a lot for the help! [18:01:09] 10DBA, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875 (10Volans) 05Open>03Resolved [18:01:11] 10DBA, 10Upstream: DB backup restore skip empty databases - https://phabricator.wikimedia.org/T200035 (10jcrespo) [18:01:37] 10DBA, 10Upstream: DB backup restore skip empty databases - https://phabricator.wikimedia.org/T200035 (10jcrespo) [18:02:53] 10DBA, 10Upstream: DB backup restore skip empty databases - https://phabricator.wikimedia.org/T200035 (10jcrespo) While the software is generically known as mydumper, mydumper actually dumps the databases, it is the command myloaded that skips it, as it iterates only over existing lower level objects. [18:03:34] and thanks for the clarification [18:04:12] it was done for myself in the future [18:51:30] 10DBA: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 (10jcrespo) [18:56:31] 10DBA: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 (10Marostegui) Those two files can go away, they are leftovers. However it will only free up 12GB. The main issue is: ``` root@db1067:/srv# du -sh * 1.5T sqldata 1.2T sqldata.s2.bak 12G tmp ``` That sqldata.s2.bak is probably from t... [18:58:01] 10DBA: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 (10jcrespo) [18:58:05] 10DBA: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 (10Marostegui) Actually it is totally fine to delete it, it has not been touched for years, ie: ``` -rw-rw---- 1 998 prometheus-node-exporter 1001M Mar 15 2017 db1067-bin.003293 ``` And it contains wikis that are not even from s1:... [18:58:07] 10DBA: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 (10jcrespo) [20:25:30] 10DBA, 10decommission: Decommission db1053 - https://phabricator.wikimedia.org/T194634 (10RobH) [20:36:56] 10DBA, 10decommission: Decommission db1053 - https://phabricator.wikimedia.org/T194634 (10RobH) a:05RobH>03Cmjohnson [20:37:22] 10DBA, 10Operations, 10decommission, 10ops-eqiad: Decommission db1053 - https://phabricator.wikimedia.org/T194634 (10RobH)