[09:29:06] jynus: i hear you're having fun with labs grants? :) are we in trouble, or is it fixable [09:29:32] talk to me in private [10:25:31] so, as you can see, nothing unfixable is happening [10:26:24] I will attend today if something happened to be broken, but I doubt it [10:28:38] we also need to fix a number of things for jessie, I commented details in the right ticket [10:55:47] which ticket? [10:56:44] T101516 [10:57:22] actually, there are no detaild there [10:57:38] phew.. i thought i was blind :) [10:57:39] but there will be [10:58:47] I think T101516 can be unsecured [10:59:06] I stopped working on it because of the grants issues [10:59:36] ongoing discussion with analytics about eventlogging purge on m4. it's going to take some time [10:59:51] on gerrit? [11:00:11] ah [11:00:17] very important [11:00:21] right, you saw the changeset https://gerrit.wikimedia.org/r/#/c/221561/. but there are new requirements [11:00:41] before I change subject [11:00:53] regarding log db and other analytics [11:01:18] in the long run, maybe thinking about column stores instead of tokudb? [11:01:28] featuring, for example, eventlogging sanitization. which may spill over into our sanitarium discussion. [11:01:51] yeah, column store was always on the list of options [11:02:05] so, now regarding something more important [11:02:20] tokudb's secondary clustered indexes are also worth more trial [11:02:34] codfw datastores (the ones with all data) [11:02:49] datastores == dbstore200x? [11:02:59] yes, I think [11:03:40] I am being vage because I do not know how many dbs are affected [11:04:01] there seems to be replication issues on s7 [11:04:16] I created a table yesteday on several masters [11:04:52] it is not on dbstore2001 or dbstore2002 (or either) [11:05:16] checksums do not work there either [11:05:38] I have not investigated further, but wanted to ask if there is anything I may be missing [11:06:01] like replication delay or filtering [11:06:01] nothing to be aware of. that seems like a bug [11:06:18] dbstore1001 is the only delayed slave so far [11:06:39] ok, then I file a bug keep you updated [11:06:42] dbstore1002 dbstore2001 dbstore2002 should all replicate normally [11:07:11] seems to be happeining for only one shard- maybe some tokudb issues [11:07:16] s7 is also centralauth, as we discussed. maybe something to do with replication rules? [11:07:34] yeah, maybe. but odd that an entire new table would not appear [11:07:37] very odd [11:07:51] I did like a fist quick check and saw nothing- it is also not centralauth related, they are wikis [11:08:19] will now stop replication and do a proper check, check exec_master, etc. [11:08:27] s/now/today/ [11:09:37] double check all the shards [11:09:59] springle, I found because I was actually doing that [11:10:15] codfw dbstore* replicate from codfw "masters" too. maybe something to do with that [11:10:22] or server_id [11:10:31] and I intend to do all shards, all servers, all wikis, all tables eventually [11:10:36] cool [11:10:49] yes, the primary master at codfw [11:10:52] is ok [11:11:01] stranger and stranger [11:11:02] that I am sure [11:11:17] yeah, that is why I wanted to double check with you [11:12:02] I will put that as a priority [11:13:31] so something I mentioned yesterday is that I am being a pain in the neck for not tusting anyones [11:13:50] on the good side, I an finding issues :-) [11:14:05] (and fixing them myself) [11:14:35] gone for lunch [11:15:04] :D [11:19:10] it is as if that server lacked "heartbeat" <--- i am going to hell for this [11:22:51] if you find a replication bug i'll switch back to being a heartbeat fan, instantly [12:55:39] i've documented issues found on T101516 on the very same ticket [12:57:31] to clarify, issues mentioned on T101516 have been fixed manually, and are things I would like to fix for future installations [16:17:33] I've created T104459 (table checks). Will add code to software/dbtools- but it for now just a couple of 5-line bash scripts [17:54:15] restarting slave on dbstore2002 [18:16:03] ERROR 1044 (42000) at line 2: Access denied for user 's52299'@'%' to database 's52299_p' [18:16:15] (labsdb) [18:16:38] could be related to grant changes [18:17:10] yeah, I was thinking the same [18:17:14] although s52299_p is a bad name of a database unless you want it world writtable [18:17:35] sorry world writtable [18:17:40] yes, I can tell it [18:17:58] I'll just another name [18:18:12] you do not own databases starting with 's52299_' [18:18:25] only the ones starting by s52299__ [18:18:43] the fact that you could create the previous one was a bug [18:19:03] interesting [18:19:10] okay, using two underscores then. thanks [18:19:10] I can rename the database to 's52299__ somthing [18:19:21] is that ok? [18:19:49] which name do you prefer, I would recomend against __p? [18:20:41] It's a temporary thing so I'll just drop the __p database later [18:20:45] and give it a sane name next time [18:20:51] thanks anyway [18:21:19] well, I will rename to s52299__something, you can delete later [18:21:38] ok.. [18:21:41] let me find where it is [18:21:43] not on 1 [18:22:03] not on 2 [18:22:40] not on 3 [19:22:40] springle: it is getting late, as it is not affecting production, but will probably involve hours of fixing, please try to work on T104471 [20:30:03] ^I've just reread this and sounds badly. I am asking you, if you have the time, to restart dbstore200[12]'s marias. I will continue with it tomorrow no matter what. [23:08:43] jynus: gotcha! will do restarts [23:39:36] restart on dbstore2002 had no effect. s7 repl still running Yes/Yes, yet no changes applied [23:40:28] s7 sql thread caught up almost instantly after restart, while other shards took time [23:40:57] suspiciously fast. as though replication rules are just skipping everything /guess