[00:18:07] <icinga-wm>	 PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[00:27:26] <icinga-wm>	 PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[01:18:57] <icinga-wm>	 PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[01:19:34] <ottomata>	 hmm
[01:29:07] <icinga-wm>	 RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[01:31:37] <ottomata>	 !log restarted hue (a few minutes ago) not totally sure why it died
[01:31:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[02:34:17] <icinga-wm>	 PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[02:45:00] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572217 (10Ottomata)
[07:12:34] <elukey>	 Hue down again?
[07:15:36] <icinga-wm>	 RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue
[07:19:19] <elukey>	 ah all the analytics websites are down
[07:21:55] <elukey>	 ok apache was down, just restarted
[07:27:59] <elukey>	 so something weird started yesterday at around 20:30 UTC - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-2d&to=now
[07:28:42] <elukey>	 I can see <demon@tin>rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) from the SAL
[07:28:55] <elukey>	 can this trigger a huge amount of rsyncs?
[07:31:20] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Set up LVS and VirtualHost for RunSingleJob.php - https://phabricator.wikimedia.org/T174599#3572471 (10Joe)
[07:47:20] <elukey>	 yep it seems a lot of rsyncs
[07:52:34] <elukey>	 there is some iowait still ongoing but the situation seems stable for the moment
[07:52:39] <elukey>	 let's see how it goes
[08:41:30] <joal>	 Hi elukey 
[08:41:39] <joal>	 Thanks for investigating that thing
[08:41:47] <joal>	 I wouldn't even have known where to look
[08:43:48] <elukey>	 joal: Hi!
[08:43:53] <elukey>	 first draft of https://wikitech.wikimedia.org/wiki/Incident_documentation/20170831-Zookeeper out :(
[08:46:35] <joal>	 elukey: Just read it - Very clear, precise and concise - Good job !
[08:52:51] <elukey>	 thanks!
[09:17:51] <elukey>	 the thing that scares me is that if the outage happened in eqiad we'd have seen a bigger impact in Hadoop etc..
[09:18:25] <joal>	 elukey: Hadoop doesn't rely on Zookeeper IIRC
[09:20:21] <elukey>	 the master nodes do
[09:20:51] <joal>	 Hm, Namenodes do for HA only I think, no ?
[09:22:30] <elukey>	 yep yep
[09:22:43] <elukey>	 not sure if they also store other things
[09:22:52] <elukey>	 and I have no idea what happens if zk is down
[09:22:54] <joal>	 so, i probably would have caused trouble for syncing nodes, but service would have continued?
[09:23:02] <elukey>	 I hope so :D
[09:23:02] <joal>	 That's an interesting question :)
[09:23:14] <elukey>	 Kafka impressed me since eventbus didn't register any weirdness
[09:23:52] <elukey>	 but the last time that I accidentally restarted the journalnodes without waiting too much between each other the master node decided to stop
[09:23:55] <joal>	 kafka don't need ZK anymore IIRC, only consumers (for syncing group index)
[09:24:05] <elukey>	 so I don't assume anything anymore :D
[09:24:11] <joal>	 huhu
[09:24:28] <elukey>	 so consumers/producers do not need zookeeper now, but Kafka does for topic metadata
[09:25:09] <elukey>	 and other things that I don't remember
[09:25:20] <elukey>	 for example I was unable to execute kafka topics --describe
[09:25:25] <elukey>	 when zk was down
[09:25:39] <elukey>	 aaaand joal, I know that you will not like it but, all the Kafka ACLs will be store in Zookeeper
[09:25:51] <joal>	 elukey: I don't mind it :)
[09:25:56] <elukey>	 this is what I discovered while playing with it
[09:26:07] <elukey>	 so if we enable them and zk is down, kafka probably kaput
[09:26:14] <joal>	 elukey: it just means there'll be tighter deps between kafka and zk :)
[09:26:19] <elukey>	 hahahaha
[09:26:22] <joal>	 no prob for that :)
[09:26:45] * elukey reads among Joseph lines that he is too paranoid
[09:27:14] <wikibugs_>	 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3572696 (10ema) 05Open>03Resolved a:03ema We've been using `vsthrottle` in prod for a while now, closing.
[09:27:35] <joal>	 elukey: I have this idea of drawing a high-level architecture plan of our systems, showing dependencies and there types (functional, internal, critical or not)
[09:28:21] <elukey>	 like the idea, might be tough to keep it in sync but definitely useful
[09:51:02] <elukey>	 joal: atm we have some issues on thorium, I think that some rsyncs are causing troubles to various services
[09:51:31] <joal>	 elukey: I've seen the alerts and your message earlier - it's really rsync that causes the problem?
[09:53:44] <joal>	 elukey: How can I try to help?
[09:54:37] <elukey>	 not sure, do you know if something like "rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided)" could cause a big amount of files to get re-created?
[09:55:03] <joal>	 elukey: I don't even know what those files do :(
[09:56:12] <elukey>	 all right, I'll dig into the rsyncs :)
[09:57:31] <elukey>	 wow IO wait spikes are massive https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&orgId=1
[09:58:44] <elukey>	 doesn't make any sense, why the oom is killing
[10:03:22] <elukey>	 so I killed the rsyncs, and I have a theory
[10:03:43] <elukey>	 root do every 15 mins /usr/local/bin/hardsync -t /srv /srv/published-datasets-rsynced/* /srv/analytics.wikimedia.org/datasets 2>&1 > /dev/nul
[10:04:22] <elukey>	 and I see spikes every 15 mins in io load
[10:04:53] <elukey>	 so maybe when the second rsyncs kicks in after 15 mins, the first one is still running because of huge files
[10:05:05] <elukey>	 adding itself to the mess
[10:05:51] <elukey>	 files are in disk cache and the oom decides to kill something else when it needs moar space for new files
[10:05:57] <elukey>	 like pivot, hue, etc..
[10:06:04] <elukey>	 not sure if it makes sense
[10:06:17] <elukey>	 !log killed root rsyncs on thorium, disabled puppet
[10:06:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:06:35] <elukey>	 disk usage magically recovered https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thorium&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-1h&to=now
[10:06:45] <elukey>	 also iowait
[10:08:25] <elukey>	 interesting, now /usr/local/bin/hardsync -t /srv /srv/published-datasets-rsynced/* /srv/analytics.wikimedia.org/datasets completes in no time
[10:09:01] <joal>	 elukey: you theory goes with ottomata's one --> T174756
[10:09:01] <stashbot>	 T174756: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756
[10:09:22] <joal>	 elukey: that is very very weird
[10:09:58] <elukey>	 ahhh I didn't see that one!
[10:10:17] <joal>	 elukey: Maybe if 2 rsyncs of the same folders happen at the same time, they share resources and mess up with each other, leading to a both an invalid result and potential disk issues?
[10:11:30] <elukey>	 the main issue is that they eat a ton of disk cache and the oom killer prefers to kill live processes to free anonymous memory
[10:13:43] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572775 (10elukey) Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use dis...
[10:13:55] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572776 (10elukey) p:05Triage>03High
[10:21:16] <elukey>	 disabled puppet + regular rsync on thorium for the moment
[10:25:24] <joal>	 k elukey 
[10:25:53] <joal>	 I think ottomata is off today - we'll probably need to wait next monday I assume
[10:30:12] <elukey>	 I checked in /proc/etc.. and it seems that the rsync is currently getting a ton of json stuff
[10:30:24] <elukey>	 so I'd leave it running today and worst case this weekend 
[10:30:42] <elukey>	 but with new rsyncs disabled in the crontab
[10:30:48] <elukey>	 so the host will not die (hopefully)
[10:30:49] <joal>	 when you say a ton of json stuff, can you tell where I can look at that?
[10:31:42] <elukey>	 with lsof I can see
[10:31:42] <joal>	 elukey: --^
[10:31:43] <elukey>	 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json
[10:32:02] <elukey>	 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json
[10:32:14] <elukey>	 ah sorry the second is the temp file
[10:32:35] <elukey>	 file is now 24G
[10:33:09] <elukey>	 ah and now there is another rsync for the same file
[10:33:12] <elukey>	 lemme kill it
[10:34:01] <joal>	 elukey: those files are recent - created yesterday, around the time the rsync issues start
[10:34:19] <elukey>	 bingo then
[10:34:22] <joal>	 elukey: One of those file (you one you pointed) is actually 82G
[10:35:05] <joal>	 Most files are okish (1+G, compressed), but that specific one huge
[10:39:31] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3572830 (10elukey) There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet:  ``` elukey@thorium:~$ sudo ls...
[10:40:44] <elukey>	 updated the task --^
[10:40:44] <joal>	 ping halfak when you come online  :)
[11:07:17] <joal>	 taking an early break today - will be back for meetings
[11:21:03] * elukey lunch + errand
[12:13:20] <elukey>	 so for some reason I can still see multiple rsyncs
[12:20:50] <elukey>	 stopped rsync, started and executed manually the hardsync cron
[12:28:02] <elukey>	 ok so that enwiki_20170420_reverts.json seems to be the issue
[12:51:02] <elukey>	 ahhhhhhh but the issue is  the /usr/local/bin/published-datasets-sync on stat1006
[12:51:05] <elukey>	 sigh
[12:51:30] <elukey>	 ok now I got it
[12:54:13] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3573011 (10elukey) My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe dis...
[13:09:50] <elukey>	 next step is try to increase the timeout, that afaict is only 1000s
[13:09:54] <elukey>	 that is ~15 min
[13:10:17] <elukey>	 current theory is that the huge 84G file does not get copied in time
[13:35:37] <joal>	 elukey: I'll meet with halfak in half an hour - Will ask him if that huge file is needed (I suspect the small bz2 ones are the same)
[13:36:00] <elukey>	 it is weird since it almost reach the 100G and then die
[13:36:02] <elukey>	 (the rsync)
[13:36:10] <elukey>	 so I am trying different timeouts
[13:50:05] <halfak>	 o/ joal
[13:51:29] <halfak>	 I see some stuff I'd (stupidly) temporarily put in published-datasets causes a problem. 
[13:51:33] <halfak>	 I've removed it. 
[13:51:44] <joal>	 Thanks halfak :)
[13:51:53] <joal>	 You've been faster than me explaining
[13:52:06] <joal>	 as usual should I say ;)
[13:53:16] <joal>	 elukey: I have bizaree behaviour from pivot - Do you mind restarting it ?
[13:53:59] <elukey>	 halfak: o/ - what did you remove?
[13:54:07] <elukey>	 enwiki_20170420_reverts.json ?
[13:54:17] <halfak>	 yup
[13:54:21] <halfak>	 82 GB is gone. 
[13:54:29] <elukey>	 ah I was doing tests to rsync
[13:54:37] <elukey>	 ok so I guess I can stop :)
[13:54:40] <elukey>	 it was likely hitting the timeout
[13:54:41] <halfak>	 The biggest file in that dir is now 1.7GB 
[13:55:37] <joal>	 Thanks a lot halfak :)
[13:55:50] <elukey>	 thanks :)
[13:56:16] <wikibugs_>	 10Analytics, 10Analytics-Cluster: rsync-published-datasets cron should not launch multiple rsync processes - https://phabricator.wikimedia.org/T174756#3573139 (10elukey) The file has been removed by @Halfak so everything should be back to normal.
[13:57:12] <elukey>	 pivot restarted
[13:58:53] <joal>	 hm elukey - There is a weird thing
[13:59:16] <joal>	 elukey: Druid contains 2 metrics, but pivot shows only one :(
[13:59:42] <joal>	 mwarf
[14:02:06] <elukey>	 joal: --verbose
[14:02:16] <joal>	 elukey: after meeting
[14:02:18] <joal>	 sorry
[14:22:50] <wikibugs_>	 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3573233 (10JAllemandou)
[14:23:25] <wikibugs_>	 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3053334 (10JAllemandou) a:03JAllemandou
[14:23:28] <wikibugs_>	 10Analytics: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796#3573235 (10Halfak)
[14:23:58] <wikibugs_>	 10Analytics: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796#3573248 (10Halfak)
[15:43:10] <wikibugs_>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3573471 (10elukey) Assuming that the 90% maximum limit of guided_size is not applied in the expert recipe (to be verified), I...
[16:12:13] * elukey afk! byeeeee
[16:12:17] <elukey>	 have a good weekend people
[17:02:18] <AndyRussG>	 Hi all!
[17:03:35] <AndyRussG>	 Funny, stat1005 is incredibly slow for me... shell usage is even tough. Or maybe some network issue? My connectivity to elsewhere on the net seems fine
[17:09:18] <AndyRussG>	 Hmmm also, why is it a lake?
[17:42:44] <elukey>	 AndyRussG: hi! I think that something is going on with the disk usage https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=stat1005&refresh=1m&panelId=19&fullscreen&orgId=1&from=now-24h&to=now
[17:42:57] <AndyRussG>	 elukey: ah hmm K
[17:43:33] <AndyRussG>	 iiiinteresting
[17:43:59] <AndyRussG>	 Infact seems to have improved somewhat since I commented ^
[17:44:15] <AndyRussG>	 Pls lmk if I should do anything differently as a result!! thx :)
[17:49:36] <elukey>	 AndyRussG: there is a python process that is maybe a bit aggressive in reading to the disk
[17:49:47] <elukey>	 but atm shell usage seems fine fro me
[17:49:50] <elukey>	 *for me
[17:49:54] <elukey>	 I'll notify the user
[17:55:26] <elukey>	 reniced the process and emailed the user, but atm it should be ok now!
[17:55:32] <elukey>	 going afk, but will check laterz :)
[18:30:34] <joal>	 Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure
[18:30:38] <joal>	 !log Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure
[18:30:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:38:30] <wikibugs_>	 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3573902 (10Tbayer)
[18:39:32] <wikibugs_>	 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3573916 (10Tbayer) (Not tagging this with #readers-web-backlog for now because it doesn't not seem to be an issue with...
[18:44:26] <HaeB>	 a-team: ^ this looks quite concerning, any quick thoughts? should we alert the DBAs too?
[19:11:02] <joal>	 Hi HaeB
[19:11:32] <joal>	 HaeB: it stopped being inserted two days ago ?
[19:26:22] <HaeB>	 joal: yes, that's what seems to have happened
[19:26:42] <joal>	 HaeB: Currently checking logs - everything seems normal, will continue to investigate a bit
[19:34:20] <joal>	 HaeB: Problem is Mysql-Replication (as usual) - Data is inserted corretly on master (I double just checked that)
[19:34:42] <HaeB>	 i see - i also just checked on s1-analytics-slave..
[19:35:00] <HaeB>	 ...it dropped there too, but 23 minutes later (6:59pm)
[19:35:03] <joal>	 HaeB: Since no data is lost and I'm no expert in that dedicated-replication-process, I suggest we wait for ottomata on monday
[19:35:13] <joal>	 If that's ok for you
[19:35:24] <joal>	 I'll document that in the task
[19:35:31] <HaeB>	 yes, that's fine - it's supposed to run for two weeks anyway
[19:35:41] <joal>	 k
[19:37:05] <wikibugs_>	 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3574066 (10JAllemandou) Investigated that tonight:  - No error on grafana nor in the logs  - Data is inserted on master...
[19:37:09] <joal>	 HaeB: --^
[19:37:23] <HaeB>	 cool, thanks for checking !
[19:37:28] <joal>	 No prob :)
[19:38:03] <wikibugs_>	 10Analytics, 10Analytics-EventLogging: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3574068 (10Tbayer) Thanks! PS, BTW: The above queries were run on analytics-store. The copy on s1-analytics-slave repli...
[21:43:48] <wikibugs_>	 10Analytics, 10Contributors-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3574358 (10Neil_P._Quinn_WMF) @Halibutt, let me see if I understand this request.  You're involved in discussions about the visual editor and the source editor, and which editors should...
[22:06:21] <wikibugs_>	 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3574439 (10JAllemandou) Update on this to gring @Shilad up to speed:  - Original code from Ellery; Mostly in https://github.com/ewulczyn/wiki-clickstream, some sqooping and hive cleaning in https://git...