[00:18:07] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [00:27:26] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [01:18:57] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [01:19:34] hmm [01:29:07] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [01:31:37] !log restarted hue (a few minutes ago) not totally sure why it died [01:31:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [02:34:17] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [02:45:00] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572217 (10Ottomata) [07:12:34] Hue down again? [07:15:36] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [07:19:19] ah all the analytics websites are down [07:21:55] ok apache was down, just restarted [07:27:59] so something weird started yesterday at around 20:30 UTC - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-2d&to=now [07:28:42] I can see rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) from the SAL [07:28:55] can this trigger a huge amount of rsyncs? [07:31:20] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Set up LVS and VirtualHost for RunSingleJob.php - https://phabricator.wikimedia.org/T174599#3572471 (10Joe) [07:47:20] yep it seems a lot of rsyncs [07:52:34] there is some iowait still ongoing but the situation seems stable for the moment [07:52:39] let's see how it goes [08:41:30] Hi elukey [08:41:39] Thanks for investigating that thing [08:41:47] I wouldn't even have known where to look [08:43:48] joal: Hi! [08:43:53] first draft of https://wikitech.wikimedia.org/wiki/Incident_documentation/20170831-Zookeeper out :( [08:46:35] elukey: Just read it - Very clear, precise and concise - Good job ! [08:52:51] thanks! [09:17:51] the thing that scares me is that if the outage happened in eqiad we'd have seen a bigger impact in Hadoop etc.. [09:18:25] elukey: Hadoop doesn't rely on Zookeeper IIRC [09:20:21] the master nodes do [09:20:51] Hm, Namenodes do for HA only I think, no ? [09:22:30] yep yep [09:22:43] not sure if they also store other things [09:22:52] and I have no idea what happens if zk is down [09:22:54] so, i probably would have caused trouble for syncing nodes, but service would have continued? [09:23:02] I hope so :D [09:23:02] That's an interesting question :) [09:23:14] Kafka impressed me since eventbus didn't register any weirdness [09:23:52] but the last time that I accidentally restarted the journalnodes without waiting too much between each other the master node decided to stop [09:23:55] kafka don't need ZK anymore IIRC, only consumers (for syncing group index) [09:24:05] so I don't assume anything anymore :D [09:24:11] huhu [09:24:28] so consumers/producers do not need zookeeper now, but Kafka does for topic metadata [09:25:09] and other things that I don't remember [09:25:20] for example I was unable to execute kafka topics --describe [09:25:25] when zk was down [09:25:39] aaaand joal, I know that you will not like it but, all the Kafka ACLs will be store in Zookeeper [09:25:51] elukey: I don't mind it :) [09:25:56] this is what I discovered while playing with it [09:26:07] so if we enable them and zk is down, kafka probably kaput [09:26:14] elukey: it just means there'll be tighter deps between kafka and zk :) [09:26:19] hahahaha [09:26:22] no prob for that :) [09:26:45] * elukey reads among Joseph lines that he is too paranoid [09:27:14] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3572696 (10ema) 05Open>03Resolved a:03ema We've been using `vsthrottle` in prod for a while now, closing. [09:27:35] elukey: I have this idea of drawing a high-level architecture plan of our systems, showing dependencies and there types (functional, internal, critical or not) [09:28:21] like the idea, might be tough to keep it in sync but definitely useful [09:51:02] joal: atm we have some issues on thorium, I think that some rsyncs are causing troubles to various services [09:51:31] elukey: I've seen the alerts and your message earlier - it's really rsync that causes the problem? [09:53:44] elukey: How can I try to help? [09:54:37] not sure, do you know if something like "rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided)" could cause a big amount of files to get re-created? [09:55:03] elukey: I don't even know what those files do :( [09:56:12] all right, I'll dig into the rsyncs :) [09:57:31] wow IO wait spikes are massive https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&orgId=1 [09:58:44] doesn't make any sense, why the oom is killing [10:03:22] so I killed the rsyncs, and I have a theory [10:03:43] root do every 15 mins /usr/local/bin/hardsync -t /srv /srv/published-datasets-rsynced/* /srv/analytics.wikimedia.org/datasets 2>&1 > /dev/nul [10:04:22] and I see spikes every 15 mins in io load [10:04:53] so maybe when the second rsyncs kicks in after 15 mins, the first one is still running because of huge files [10:05:05] adding itself to the mess [10:05:51] files are in disk cache and the oom decides to kill something else when it needs moar space for new files [10:05:57] like pivot, hue, etc.. [10:06:04] not sure if it makes sense [10:06:17] !log killed root rsyncs on thorium, disabled puppet [10:06:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:35] disk usage magically recovered https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-1h&to=now [10:06:45] also iowait [10:08:25] interesting, now /usr/local/bin/hardsync -t /srv /srv/published-datasets-rsynced/* /srv/analytics.wikimedia.org/datasets completes in no time [10:09:01] elukey: you theory goes with ottomata's one --> T174756 [10:09:01] T174756: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756 [10:09:22] elukey: that is very very weird [10:09:58] ahhh I didn't see that one! [10:10:17] elukey: Maybe if 2 rsyncs of the same folders happen at the same time, they share resources and mess up with each other, leading to a both an invalid result and potential disk issues? [10:11:30] the main issue is that they eat a ton of disk cache and the oom killer prefers to kill live processes to free anonymous memory [10:13:43] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572775 (10elukey) Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use dis... [10:13:55] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572776 (10elukey) p:05Triage>03High [10:21:16] disabled puppet + regular rsync on thorium for the moment [10:25:24] k elukey [10:25:53] I think ottomata is off today - we'll probably need to wait next monday I assume [10:30:12] I checked in /proc/etc.. and it seems that the rsync is currently getting a ton of json stuff [10:30:24] so I'd leave it running today and worst case this weekend [10:30:42] but with new rsyncs disabled in the crontab [10:30:48] so the host will not die (hopefully) [10:30:49] when you say a ton of json stuff, can you tell where I can look at that? [10:31:42] with lsof I can see [10:31:42] elukey: --^ [10:31:43] /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json [10:32:02] /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json [10:32:14] ah sorry the second is the temp file [10:32:35] file is now 24G [10:33:09] ah and now there is another rsync for the same file [10:33:12] lemme kill it [10:34:01] elukey: those files are recent - created yesterday, around the time the rsync issues start [10:34:19] bingo then [10:34:22] elukey: One of those file (you one you pointed) is actually 82G [10:35:05] Most files are okish (1+G, compressed), but that specific one huge [10:39:31] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572830 (10elukey) There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet: ``` elukey@thorium:~$ sudo ls... [10:40:44] updated the task --^ [10:40:44] ping halfak when you come online :) [11:07:17] taking an early break today - will be back for meetings [11:21:03] * elukey lunch + errand [12:13:20] so for some reason I can still see multiple rsyncs [12:20:50] stopped rsync, started and executed manually the hardsync cron [12:28:02] ok so that enwiki_20170420_reverts.json seems to be the issue [12:51:02] ahhhhhhh but the issue is the /usr/local/bin/published-datasets-sync on stat1006 [12:51:05] sigh [12:51:30] ok now I got it [12:54:13] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3573011 (10elukey) My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe dis... [13:09:50] next step is try to increase the timeout, that afaict is only 1000s [13:09:54] that is ~15 min [13:10:17] current theory is that the huge 84G file does not get copied in time [13:35:37] elukey: I'll meet with halfak in half an hour - Will ask him if that huge file is needed (I suspect the small bz2 ones are the same) [13:36:00] it is weird since it almost reach the 100G and then die [13:36:02] (the rsync) [13:36:10] so I am trying different timeouts [13:50:05] o/ joal [13:51:29] I see some stuff I'd (stupidly) temporarily put in published-datasets causes a problem. [13:51:33] I've removed it. [13:51:44] Thanks halfak :) [13:51:53] You've been faster than me explaining [13:52:06] as usual should I say ;) [13:53:16] elukey: I have bizaree behaviour from pivot - Do you mind restarting it ? [13:53:59] halfak: o/ - what did you remove? [13:54:07] enwiki_20170420_reverts.json ? [13:54:17] yup [13:54:21] 82 GB is gone. [13:54:29] ah I was doing tests to rsync [13:54:37] ok so I guess I can stop :) [13:54:40] it was likely hitting the timeout [13:54:41] The biggest file in that dir is now 1.7GB [13:55:37] Thanks a lot halfak :) [13:55:50] thanks :) [13:56:16] 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3573139 (10elukey) The file has been removed by @Halfak so everything should be back to normal. [13:57:12] pivot restarted [13:58:53] hm elukey - There is a weird thing [13:59:16] elukey: Druid contains 2 metrics, but pivot shows only one :( [13:59:42] mwarf [14:02:06] joal: --verbose [14:02:16] elukey: after meeting [14:02:18] sorry [14:22:50] 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3573233 (10JAllemandou) [14:23:25] 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3053334 (10JAllemandou) a:03JAllemandou [14:23:28] 10Analytics: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796#3573235 (10Halfak) [14:23:58] 10Analytics: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796#3573248 (10Halfak) [15:43:10] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3573471 (10elukey) Assuming that the 90% maximum limit of guided_size is not applied in the expert recipe (to be verified), I... [16:12:13] * elukey afk! byeeeee [16:12:17] have a good weekend people [17:02:18] Hi all! [17:03:35] Funny, stat1005 is incredibly slow for me... shell usage is even tough. Or maybe some network issue? My connectivity to elsewhere on the net seems fine [17:09:18] Hmmm also, why is it a lake? [17:42:44] AndyRussG: hi! I think that something is going on with the disk usage https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=stat1005&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-24h&to=now [17:42:57] elukey: ah hmm K [17:43:33] iiiinteresting [17:43:59] Infact seems to have improved somewhat since I commented ^ [17:44:15] Pls lmk if I should do anything differently as a result!! thx :) [17:49:36] AndyRussG: there is a python process that is maybe a bit aggressive in reading to the disk [17:49:47] but atm shell usage seems fine fro me [17:49:50] *for me [17:49:54] I'll notify the user [17:55:26] reniced the process and emailed the user, but atm it should be ok now! [17:55:32] going afk, but will check laterz :) [18:30:34] Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure [18:30:38] !log Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure [18:30:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:38:30] 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3573902 (10Tbayer) [18:39:32] 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3573916 (10Tbayer) (Not tagging this with #readers-web-backlog for now because it doesn't not seem to be an issue with... [18:44:26] a-team: ^ this looks quite concerning, any quick thoughts? should we alert the DBAs too? [19:11:02] Hi HaeB [19:11:32] HaeB: it stopped being inserted two days ago ? [19:26:22] joal: yes, that's what seems to have happened [19:26:42] HaeB: Currently checking logs - everything seems normal, will continue to investigate a bit [19:34:20] HaeB: Problem is Mysql-Replication (as usual) - Data is inserted corretly on master (I double just checked that) [19:34:42] i see - i also just checked on s1-analytics-slave.. [19:35:00] ...it dropped there too, but 23 minutes later (6:59pm) [19:35:03] HaeB: Since no data is lost and I'm no expert in that dedicated-replication-process, I suggest we wait for ottomata on monday [19:35:13] If that's ok for you [19:35:24] I'll document that in the task [19:35:31] yes, that's fine - it's supposed to run for two weeks anyway [19:35:41] k [19:37:05] 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3574066 (10JAllemandou) Investigated that tonight: - No error on grafana nor in the logs - Data is inserted on master... [19:37:09] HaeB: --^ [19:37:23] cool, thanks for checking ! [19:37:28] No prob :) [19:38:03] 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3574068 (10Tbayer) Thanks! PS, BTW: The above queries were run on analytics-store. The copy on s1-analytics-slave repli... [21:43:48] 10Analytics, 10Contributors-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3574358 (10Neil_P._Quinn_WMF) @Halibutt, let me see if I understand this request. You're involved in discussions about the visual editor and the source editor, and which editors should... [22:06:21] 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3574439 (10JAllemandou) Update on this to gring @Shilad up to speed: - Original code from Ellery; Mostly in https://github.com/ewulczyn/wiki-clickstream, some sqooping and hive cleaning in https://git...