[09:15:41] <_joe_> jynus: In a few minutes, I should be able to start the updateCollation.php scripts [09:15:59] <_joe_> It'll run sequentially within each shard [09:16:02] which sites? [09:16:08] I have a schema change pending [09:16:29] do you have a list of s3 wikis? [09:16:36] <_joe_> you can find the lists in terbium:~oblivian/icu/ [09:16:48] thanks [09:16:50] <_joe_> but if you tell me the wikis, I can check [09:16:59] I can grep too :-) [09:17:06] <_joe_> eheh fair enough [09:28:34] <_joe_> jynus: can I start the scripts? let's start with s3 and turn on the others in batches? [09:35:39] <_joe_> Ok I am taking a brief break before starting the scripts, let me know when it's ok for me to proceed [09:51:48] yes, start them all [09:51:57] I will work around it [09:58:12] <_joe_> jynus: ack, doing it [09:59:00] can you run in on a scree or make it log so I can check the state when you are not around? [10:00:15] <_joe_> jynus: it's running on a screen in terbium [10:00:29] <_joe_> under my user, so sudo -u oblivian screen -r [10:00:39] <_joe_> 4 windows, one for each shard [10:00:45] great [10:01:05] <_joe_> would you prefer logfiles instead of stdout? [10:01:41] <_joe_> it's early enough that I can stop the process and log to logfiles instead [10:01:46] it is ok [10:02:14] I just need to know the current db to not do things in parallel [10:04:02] <_joe_> abd then this output is not good [10:04:08] <_joe_> let me stop and tweak that [10:05:57] what I usually do is: echo "`date`: $host $db $table" to know the current state and how much it took [10:10:33] <_joe_> jynus: ok so now the output of the script is in ~oblivian/icu/$wiki.log [10:10:43] <_joe_> while in the screen you see the currently running wikis [10:10:59] <_joe_> but you can get the same list across all shards using ls -lart in that directory :) [10:11:57] great- logging is actually important because many of our scripts doesn't have correct exception-like handling [10:16:58] I think worst case scenario I can check the database [10:29:28] <_joe_> it runs much faster with output sent to a file, damn php output [10:31:18] that is more or less normal of anything- sending to console (or screen) probable has more overhead than pure filesystem cache; also there is the observer's effect- sending it through ssh "looks" slower [10:31:34] <_joe_> uhm I am noticing [10:31:46] <_joe_> mwscript uses php5 [10:31:52] oh [10:32:06] how is that? [10:32:44] <_joe_> that is because there was an issue with hhvm which no one fixed since then, I'd say [10:34:41] <_joe_> but it's ok as php5 is linked to libicu52 too [10:48:24] <_joe_> sadly frwiki is slower than any other wiki [10:48:44] well, that is expected [10:48:51] not only things do not scale linearly [10:49:00] it will also be th busiest server [10:49:07] <_joe_> fair enough [10:49:46] e.g. enwiki may have 1/3 of the data, but maybe 1/2 of the activity [10:50:32] a small wiki may not have activity in hours if its community is very localized in a single country [10:50:37] <_joe_> it is doing ~ 1.4/1.5 M rows per hour, compared to cswiki (s2) and fawiki (s7) who are doing 3 M /hour [10:52:07] also s7 is a bit overprovisioned because it is also a SPOF for some things [10:52:44] s2 gets high activity until our afternoon/evening [10:53:27] at some point we may reshard some wikis, but it is so much work and very little reward that things should be worse to thinking it seriously [10:54:41] <_joe_> :) [10:56:30] wikidata may be the #1 candidate [10:56:43] if it keeps growing at the same pace [17:38:48] looks like it crashed while flushing pages [17:39:13] it is still booting [17:40:10] initialiting 40000 tables is not fast [17:40:47] yeah, do you want me to open a ticket with the stacktrace in the meanwhile? [17:41:00] https://phabricator.wikimedia.org/T136333 [17:41:08] it is the standard OOM thingy [17:41:12] not worth it [17:41:26] only the aftermath will be the issue [17:42:09] why do you think was OOM? [17:42:38] did you restarted it because of the upgrade or did you powecycle it? [17:42:48] I restarted it [17:43:02] only mysql crashed, not the host [17:43:47] why? generic assert + ongoing imports from this host + discovering large queries happening just yesterday [17:44:04] plus the graphs show a slow increase in memory until that time [17:44:13] ok, but was not the oom_killer [17:44:24] mysqld got signal 11 [17:44:59] yes [17:45:22] the trace going to libc.so.6(clone+0x6d) [17:46:28] I mean, I am 100% certain? no, but probably 99% [17:49:42] :) [17:50:05] occam's razor [17:53:37] toku's recovery is quite slow [17:53:45] you tell me :-) [17:54:11] but I do not think I have to explain more my love for that engine [17:54:26] to be fair, it is a lot of objects [17:54:37] yeah, I was trying to estimate to decide if taking the bus to get home now or wait that it starts [17:54:55] na, why, the fun will be later! [17:55:38] I will update the ticket when it starts with the state [17:55:41] yeah I'll probably I'll be more helpful later from home... going to get the bus, feel free to ping me on hangout anytime ;) [18:03:44] If I enabled GTID on dbstore1002, I will want to kiss myself [18:46:45] <_joe_> [18:47:10] <_joe_> I am trying to imagine jynus kissing himself [19:02:38] rotfl... back online (finally) [19:02:43] how can I help? [19:06:10] fwiw there is a disk in predictive failure on dbstore1002 [19:16:23] * volans getting some dinner, I see everything is already working, only s1 is behind