[07:21:53] Analytics-Kanban, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2202237 (elukey) [07:22:18]

PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [07:22:31]

PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [07:22:35] good morning :) [07:22:55] auto resolved, but this is the third analytics host showing high temp [07:23:00] updated the phab task [07:59:09] elukey: Hi ! [08:05:03] joal: helllooo [08:22:03]

PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [08:24:43] 2016-04-13 08:15:06,466 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now... [08:24:46] java.lang.OutOfMemoryError: GC overhead limit exceeded [08:24:47] joal --^ [08:26:28] weird elukey :( [08:27:22] and it wasn't restarted automagically, could be something weird in the upstart config [08:28:23] mwarf [08:40:16] elukey: I restarted the failed refine job [08:41:03] from hue? [08:41:07] yes [08:41:14] * elukey finally got one right [08:41:20] :) [08:41:48] was it related to the 1054 failure or totally unrelated? (I saw that was coming from 1015) [08:42:17] elukey: I have not double checked [08:49:44] elukey: job failed at about the same time as namenode error - could well be related [08:51:29] super weird, theoretically the yarn daemon should be restarted straight away [08:52:08] :( [08:52:41] I mean, "the correct thing would be", meanwhile our config might not be doing it.. I'll open a phab task [08:52:52] cool elukey [08:55:01] 10:52 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:06] one minute downtime for aqs [08:55:12] I mean, aqs1002 [08:55:18] this is the second time that I see the problem [08:58:23] https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase --> tons of errors like the ones we were seeing :( [09:02:19] ah joal now I understand what you were saying [09:02:20] READ messages were dropped in last 5000 ms: 5 for internal timeout and 0 for cross node timeout [09:02:26] from https://logstash.wikimedia.org/#/dashboard/elasticsearch/analytics-cassandra [09:07:30] elukey: batcave? [09:09:22] elukey: aqs1002 seems back to normal [09:10:00] (PS1) Addshore: Fix java package names [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283136 [09:10:27] (CR) Addshore: [C: 2 V: 2] Fix java package names [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283136 (owner: Addshore) [09:11:02] joal: not needed don't worry (but thanks!), also aqs1003 experienced problems [09:11:20] all related to cassandra timeouts as you were saying, but now I got why and how to find it :) [09:11:40] elukey: huge peak of requests [09:12:44] yeah.. :( [09:13:16] (well 25 req/second is not that huge) [09:13:19] :P [09:14:54] elukey: 25reqs/s is big when it spans 2TB of dfata on rotating disks;) [09:15:09] good point :D [09:16:38] moar SSDs! [09:19:09] elukey: indeed, waiting for SSDs eagerly [09:44:55] (PS1) Addshore: Add pre & post processing to Processor interface [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283143 [09:45:46] (CR) Addshore: [C: 2 V: 2] Add pre & post processing to Processor interface [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283143 (owner: Addshore) [10:04:40] (PS1) Addshore: Use double instead of long in MetricProcessor [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283146 [10:05:55] joal: analytics 1043 and 1049 showed the same Yarn failure :( [10:06:08] (PS1) Addshore: New build [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283148 [10:06:21] elukey: maaaaan [10:06:21] (CR) Addshore: [C: 2 V: 2] Use double instead of long in MetricProcessor [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283146 (owner: Addshore) [10:06:24] This is ops day [10:06:24] (CR) Addshore: [C: 2 V: 2] New build [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283148 (owner: Addshore) [10:09:37] (PS1) Addshore: Fix finding processor classes [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283149 [10:09:39] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202590 (elukey) [10:10:11] opened a phab task --^ [10:10:14] (PS1) Addshore: Fix finding processor classes [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283150 [10:10:27] (CR) Addshore: [C: 2 V: 2] Fix finding processor classes [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283149 (owner: Addshore) [10:10:30] (CR) Addshore: [C: 2 V: 2] Fix finding processor classes [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283150 (owner: Addshore) [10:13:21] aaaaannnnd we have a correspondent oozie error :) [10:14:17] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202603 (elukey) [10:17:16] Analytics, Operations, Traffic, Patch-For-Review, Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202604 (ema) One more: Feb 11 09:17:47 cp4010 varnishstatsd[2820]: Traceback (most recent call la... [10:24:10] Analytics-Kanban: Puppet on stat1003 keeps failing for git errors - https://phabricator.wikimedia.org/T132445#2202621 (elukey) p:Unbreak!>Normal [10:24:47] Analytics, Operations: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2202623 (elukey) p:Triage>Low [10:27:34] Analytics-Kanban, Operations, Traffic, Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (elukey) Installed on maps hosts by @ema, we will rollout the new version everywhere along wiht the Varnish 4 upgrade. [10:42:06] Analytics, Operations, Traffic, Patch-For-Review, Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202664 (ema) And this one: Mar 18 12:41:35 cp4010 varnishstatsd[10396]: Traceback (most recent ca... [11:16:58] 13:16 PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [11:17:04] this starts to be a big problem [11:17:25] raaaaah [11:18:55] elukey: any idea from where it could come from ? [11:20:09] joal: didn't get the time to investigate, I'll do it right after lunch.. I'll probably need to have a chat with you about how to track it down [11:20:32] elukey: I actually have no idea currently [11:20:39] Let's talk when you get back [11:21:29] So the nodemanager is running with -Xmx1000m [11:22:01] $maybe something more is needed for $reason [11:22:16] rmh [11:23:45] * elukey imagines Joseph's face mumbling [11:26:19] :) [11:53:07] (PS1) Addshore: Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 [11:55:11] (PS2) Addshore: Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 [12:05:12] (PS1) Addshore: New CLI arg processing [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283157 [12:19:44] * elukey loves http://animagraffs.com/ [12:34:45] * joal thanks elukey for such a good link ! [12:46:14] Analytics-Tech-community-metrics: Tech community KPIs for the WMF metrics meeting - https://phabricator.wikimedia.org/T107562#2202939 (Aklapper) [12:46:17] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2202937 (Aklapper) Open>Resolved Please continue discussing bring... [12:46:39] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2202953 (Aklapper) [12:47:46] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016): Identify Wikimedia's most important/used info panels in korma.wmflabs.org - https://phabricator.wikimedia.org/T132421#2202956 (Aklapper) p:Triage>High [12:50:03]

PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [12:59:28] so it might be related to jobs causing OOM errors, but not sure why the namenode itself shutsdown [13:08:45] elukey: jobs failing should not impact nodemanager [13:12:03] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203044 (elukey) From https://grafana.wikimedia.org/dashboard/db/server-board I don't see major memory problems at host level, a lot of GBs are simply cac... [13:12:57] joal: yeah.. the weird thing is that from the logs the nodemanager shutsdown right after receiving the OOM, not sure if from a mapper or caused by itself [13:13:26] elukey: my guess is that it is an internal error to the nodemanager [13:22:05] elukey: looking at ganglia, it seems the cluster have been overloaded between 11:45 and 12:00 [13:22:09] UTC [13:25:05] GOOD MORNING [13:25:30] Hi ottomata ! [13:25:46] joal: https://phabricator.wikimedia.org/T102954 :P [13:25:49] hello ottomata !! [13:26:16] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203092 (elukey) Another ticket was opened for the same thing https://phabricator.wikimedia.org/T102954 [13:26:50] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203094 (elukey) Open>Resolved p:Triage>Normal [13:27:23] Analytics-Cluster: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#1378495 (elukey) Opened https://phabricator.wikimedia.org/T132559 before checking phab, today multiple hosts showed this behavior and I had to restart Yarn manually [13:28:13] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2203101 (Qgil) [13:32:02] Analytics-Kanban, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203112 (elukey) ``` elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c' analytics1041.eqiad.wmnet: analytics1... [13:33:02] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203113 (elukey) p:Triage>High [13:34:33] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203116 (Ottomata) This clearly doesn't happen often. In lieu of not knowing what causes this, it wouldn't hurt to increase NodeManager memory a little, eh? [13:38:01] hm, elukey just saw another one of those, eh? [13:40:59] (CR) Addshore: [C: 2 V: 2] Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 (owner: Addshore) [13:41:02] (CR) Addshore: [C: 2 V: 2] New CLI arg processing [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283157 (owner: Addshore) [13:41:46] ottomata: the weird thing is that (as you were noticing in your task long time ago) I can't figure out where the OOM comes from and why the node manager shuts down [13:42:47] hm logs on 1056 maybe look al ittle different? [13:42:54] or, maybe i just didn't look at them all on the other nodes well [13:43:05] The directory item limit of /var/log/hadoop-yarn/apps/hdfs/logs is exceeded: limit=1048576 items=1048576 [13:43:26] that's > 10 mins before the alert [13:44:30] mmmmmm [13:45:14] shall we increase the -Xmx option for the yarn namenode to patch the problem? [13:45:15] elukey: i'm thinking there must be some particular job behavior that causes this [13:45:25] i'm suspicious of this [13:45:26] http://localhost:8088/cluster/app/application_1458129362008_86444 [13:45:35] since it is seems to be mentioned near the problem in teh logs... [13:45:43] probably.. but I wouldn't have expected the nodemanager to go down [13:45:44] but, that could just be an artifact [13:45:52] ja me neither, but maybe some weird bug [13:45:57] ottomata: yurik has a hell lot of jobs lately [13:46:15] ja i'd like to know what this is [13:46:24] ottomata: another thing to mention about logs: we log yarn at debug level!!! [13:46:24] 1009 reduces [13:46:34] oh? [13:46:40] I see INFO [13:46:41] ottomata: 6 concurrent jobs (~2500 maps and 1009 reduces) [13:46:47] ah [13:47:16] ottomata: Ah... When doing cassandra testing, couldn't get rid of debug [13:48:27] joal, elukey: what did we decide regarding an upgrade to Cassandra 2.2.5? or maybe, what's next there? [13:49:11] urandom: Arrg, forgot to ask the wider team :S [13:49:18] urandom: I deeply apologize [13:49:30] joal: no worries :) [13:49:40] urandom: I'm making a note and you'll have an answer later on today, after our standup [13:49:45] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203156 (Ottomata) Seeing some interesting logs on an56 which had a NodeManager just die. Not sure if related: ``` Caused by: org.apache.hadoop.ipc.RemoteExce... [13:49:54] auh, then my timing is excellent then. [13:49:59] joal: oh you mean for the yarn job logs itself? [13:50:03] the ones that are stored in hdfs after job completion? [13:50:11] ottomata: correct sir [13:50:11] not the yarn daemon process logs [13:50:13] aye [13:50:14] hm ok [13:51:16] urandom: atm we are seeing tons of errors from cassandra timeouts (causing also problems to aqs restbase) and I'd prefer to get it resolved (waiting for SDDs) before proceeding with an important upgrade. This is of course something said from somebody with little context in AQS, so not sure if it makes sense :) [13:52:33] elukey: very good point [13:52:39] elukey: yea let's up the yarn heapsize [13:53:01] elukey: ok, i understand [13:53:03] the way it is puppetized, we'll increase the resourcemanager heap size too, but that won't hurt at all [13:53:15] plenty of mem on 01 and 02 for that [13:53:20] ->2048? [13:53:33] elukey: do you have an eta on that? are you replacing the nodes, or replacing the disks in the existing nodes? [13:54:28] ottomata: ack! 2048 looks good [13:54:51] joal: do you have more info about the SSDs? I know we discussed it but I don't recall the ETA :( [13:55:16] elukey, urandom : ottomata is the one knowing most here [13:56:41] https://phabricator.wikimedia.org/T124947 [13:56:58] https://phabricator.wikimedia.org/T132067 (not sure if you can see that one) [13:57:38] I think they are ordered... [13:57:42] ya [13:57:46] we are replacing the nodes with SSDs [13:57:52] ETA has been asked 8 days ago ... [13:58:01] yeah :( [13:58:06] joal i think that q is stale, see the linked tickets [13:58:07] can we get an eta on the eta? [13:58:15] * urandom jokes [13:58:20] bumping [13:58:20] :D [13:58:34] Thanks a lot ottomata (I don't have access to the second ticket) [13:58:44] Analytics, Operations, hardware-requests, Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203165 (Ottomata) @Robh, over in T132067 it looks like these nodes were ordered, is this correct? If so, any idea on ETA? [13:58:46] are you planning to first bootstrap the new nodes, and then decommission the existing ones? [13:59:19] hey joal elukey, since we've missed it for a couple of weeks, want to do analytics ops sync up today? [13:59:24] maybe 30 mins before standup? [13:59:26] urandom: TBD with elukey and ottomata, my guess would be to replace old by new one by one [13:59:33] ottomata: sure ! [13:59:38] ottomata: +1 [13:59:43] joal: it would be better if you could first bootstrap in the new ones [13:59:59] urandom: bootstrap = copy data? [14:00:17] bootstrap == join a node to the cluster (and copy over data) [14:00:28] urandom: I would have left cassandra deal with data replication, but it might not be best [14:00:40] urandom: Makes sense [14:00:41] joal: not sure what you mean [14:01:24] urandom: add the node to the cluster as a new node (4 nodes in cluster, possible new virtual DC to facilitate copy), copy dat using replication, then replace [14:01:39] and not remove/replace and assume copy would work [14:01:51] makes sense (to me) urandom [14:01:59] i would stand up the new nodes, and bootstrap, bootstrap, bootstrap [14:02:08] urandom: ok [14:02:09] then decommission, decommission, decommission [14:02:50] urandom is a sequential island in today's async world [14:02:50] you'll get one stream for each source/dest pair, and often the throughput of that stream is cpu-bound on a single thread [14:02:50] :D [14:03:02] i.e. if you're using compression [14:03:18] by having more nodes, you'll be adding stream concurrency [14:03:25] and it'll move faster [14:03:28] urandom: ok [14:03:47] until you get to the last decommission of course, which will move at the same speed [14:03:56] urandom: what about DTCS compression, how does treplication handle the thing ? [14:04:13] joal: not sure what you mean [14:04:28] urandom: about orders, does replication keep it ? [14:04:33] yes [14:04:40] (i guess so, but prefer to ask) [14:04:43] yeah ok [14:04:59] yeah, data that is streamed over gets pushed back through the compaction pipeline [14:05:45] urandom: is there a way to manage the order of data streamed ? [14:06:06] joal: nope [14:06:09] urandom: Since we have backfilled data, there are some misordering in the current DTCS [14:06:13] urandom: ok :) [14:06:35] it's entirely possible DTCS isn't working the way you expect anyway [14:06:40] it's not for us [14:07:00] https://people.wikimedia.org/~eevans/20140401_to_20140404.svg [14:07:08] mobrovac: ^^^ dunno if you saw this [14:08:25] joal: when i get the time, i plan to look into an alternative (TimeWindowCompactionStrategy), that is probably a better fit [14:08:58] urandom: ok, let me know when you start, I'd love to know more [14:10:26] that's some strange compaction indeed [14:10:52] mobrovac: tl;dr there is only one data tier, encompassing everything, evar [14:11:03] mobrovac: or put another way, we're using STCS [14:11:15] it seems so, yeah [14:11:29] wasn't dtcs supposed to bound them? [14:11:36] yeah [14:11:44] it's probably read repair [14:12:25] ok, but it should abide the same rules, right? [14:12:44] what rules? [14:13:08] compaction rules [14:13:25] in my mind, dtcs should say "ok, that's enough, move on" [14:13:32] a read repair that dredged up old data to repair a node would introduce that old write into a new sstable [14:13:59] creating an sstable with a min timestamp from far in the past, and max timestamp of basically 'now' [14:14:17] that would be its window, and would make it a candidate to merge with similar sstables [14:14:19] ottomata: I was about to ask you a question for the heap size but I saw the code reviews :) [14:14:26] that's what you can see happening in that graph [14:14:30] ah sorry [14:14:30] ja [14:14:37] ah [14:14:42] shoulda told you i was on it [14:14:42] very nice that we have everything in hiera! [14:14:45] ja! [14:15:41] mobrovac: if you look at the green nodes at the very top of the graph, notice all the ones what have a timestamp interval of almost a year [14:16:14] those are candidates for compaction with brand new tables, *and* very old tables, they effectively flatten the entire structure [14:16:50] (the date-tiered structure, i mean) [14:17:05] yeah, they get swallowed basically [14:18:47] milimetric: here ? [14:18:57] yep [14:19:00] what's up [14:19:14] quick review of pageview project doc [14:19:17] milimetric: --^ [14:19:28] what about: If you want to filter by project, use the domain of any Wikimedia project, for example 'en.wikipedia.org', 'www.mediawiki.org' or 'commons.wikimedia.org'. If you are interested in all pageviews regardless of project, use all-projects. [14:19:44] joal: perfect [14:19:48] cool :) [14:19:57] Update, then PR ! [14:21:17] ottomata: also, if you have time https://gerrit.wikimedia.org/r/#/c/282936/ [14:21:23] joal, elukey: https://phabricator.wikimedia.org/P2892 [14:21:25] so I'll start the puppet agent :) [14:21:40] sstables/read is probably what is killing your read latency [14:21:53] urandom: woa! [14:21:55] on rotational disks, that is a lot [14:22:03] yay thank you elukey! [14:22:06] +2 [14:22:17] urandom: yeah, even without exact numbers, we knww it was that [14:22:29] elukey: you think we should restart all NMs 1 by 1, or just wait until they need restart to apply that? [14:22:32] joal: that is compaction doing a poor job [14:22:48] urandom: there are almost 2TB of data to get accessed randomly ... [14:23:00] * mobrovac wonders if there exists a cass prod install that's not on SSDs [14:23:07] urandom: aggreed, it could do a better job [14:23:08] joal: yeah, compaction is meant to make that much less random [14:23:10] apart AQS, ofc [14:23:10] :P [14:23:29] * joal turns his back to mobrovac [14:23:29] mobrovac: tons [14:23:32] :D [14:23:36] haha [14:24:20] urandom: backfill has probably broken the DCTS I imagine ... [14:24:22] ottomata: mmmm I thought that puppet would have done it for us, but probably no.. so I'd like to restart them now since a lot of hosts alarmed today :( [14:24:48] Analytics-Kanban, Operations, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2203239 (Milimetric) I'd say we should wait for @ezachte to figure out what we should do with those 2T. Pinging. [14:25:13] no, it won't do it [14:25:23] we don't let puppet refresh those [14:25:32] ok elukey i'll do them one by one [14:25:44] def works [14:25:45] -Xmx2048m [14:25:46] joal: can you paste the output of this somewhere? -- /home/eevans/compaction/print_sstables_info /var/lib/cassandra/data/local_group_default_T_pageviews_per_article_flat/data [14:25:48] on nodemanager proc [14:25:49] \o/ [14:25:52] * urandom is without root [14:25:57] ottomata: if you want we can split [14:26:30] urandom: I'm not root either [14:26:32] elukey: naw s'ok, i'll just script it with a 3 minute sleep in between [14:26:43] elukey: can you answer urandom request? [14:28:20] actually... [14:28:24] i can sudo -u cassandra [14:28:39] elukey: hm, we used to have salt role grains we could easily target, i don't htink they work anymore [14:28:44] i might make some grains now...:) [14:28:50] * elukey likes grains [14:28:56] hmm maybe [14:29:14] urandom: I can run the command but should it be run on aqs100x? [14:29:31] Analytics, RESTBase, Services, User-mobrovac: configure RESTBase pageview proxy to Analytics' cluster on wiki-specific domains - https://phabricator.wikimedia.org/T119094#2203247 (Milimetric) I think there are no real blockers to doing this other than the most likely painful bikeshed. So we can... [14:30:14] elukey: nevermind, i got it: https://phabricator.wikimedia.org/P2893 [14:30:33] elukey, joal: that is really awful btw [14:30:52] urandom: I could have guessed that :( [14:31:26] there is a 581G file in there! [14:31:39] urandom: Yayyyyyy :) [14:31:45] Hurray for big fillllles [14:32:12] I am very ignorant but what causes an sstable to grow like that? [14:33:14] elukey: DTCS is supposed to create windows by (write) time, the idea being, that once the tables in a window reach a certain age, they'd stop being compacted with one another [14:33:19] (or very rarely) [14:33:22] elukey: compaction, which is cassandra's process of consolidating and ordering data [14:33:34] old tables would get merged with old tables, new tables with new tables, etc [14:34:22] elukey: here, it looks like out of order writes are doing to aqs, what they're doing to restbase, flattening out data tiers so that all data gets merged with all data [14:34:35] Analytics-Kanban: Fix number formatting in charts - https://phabricator.wikimedia.org/T132579#2203281 (Milimetric) [14:34:47] though, you can almost discern two tiers here, like the out-of-order writes are quite as routine as they are for us [14:34:55] aren't, that is [14:35:32] adding SSDs is a pretty brute-force approach [14:36:12] urandom: Is there a way to tell cassandra about a timestamp info of of the it using the write one ? [14:36:39] for cassandra to compact time-related data together whatever time it is written [14:38:01] joal: not sure what you mean, do you mean can you tell compaction to use an alternative time? [14:38:18] !log restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size [14:38:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:38:23] urandom: correct [14:38:39] joal: nope, it's write time, everything else would be schema-specific [14:39:01] joal: you can supply write time though [14:39:02] urandom: ok [14:39:07] you can specify it when making a write [14:39:10] urandom: really? [14:39:17] didn't know that [14:39:28] yeah, but that's probably not going to help here [14:39:41] ok [14:40:01] urandom: the shame is that our data is very evenly partitionned based on time (daily) [14:40:22] It could be easy to compact, let's say, monthly, and keep it like that [14:40:25] DTCS makes the assumption that you access your data in order, and that that order matches the order it was written in [14:40:38] joal: yeah, you need TWCS [14:40:46] urandom: keep asking trivial questions - the motivation for the DTCS windows should be that new values will be found in more recent sstables or even in the memtable [14:41:01] then you could game write-time when backfilling to preserve locality [14:41:22] urandom: yay, that would sense indeed [14:41:43] elukey: yes, as it merges smaller tables to create larger once, it does so ordered by writetime [14:41:49] urandom: What about existing data - only way would be to reinsert, right ? [14:42:28] elukey: if your data is ordered by time, and if you are writing in time-order, then that will make queries answerable in the fewest possible tables, because the data is contiguous [14:42:52] elukey: newer results in new tables, older results in older tables [14:43:06] joal: no, you can recompact [14:43:30] urandom: but recompact would be based on write time :S [14:43:30] joal: assuming a move to TWCS, where recompacting would separate into time-based windows [14:44:05] joal: in DTCS, the damage is done [14:45:20] joal, elukey: the other thing you could do here is host more Cassandra nodes [14:45:50] more nodes, with a lower node density would bring down the sstables/read too [14:46:13] that probably means running multiple instances of Cassandra (which is what we do on the RESTBase cluster) [14:48:44] ah nice... [14:49:00] Analytics-Cluster, Operations, Traffic, HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203363 (BBlack) I don't see any mixed content in simple checks (and we checked/fixed that in a much earlier ticket: (T93702). Since this site is clearly for human c... [14:56:28] hey joal, cmjohnson swapped the faulty disk on aqs1001 [14:56:48] we haven't been able to figure out how to make the OS recognize the hot swapped device [14:56:55] when we've done this before we've rebooted nodes [14:57:01] so, i think we need to reboot aqs1001 [14:57:12] ottomata: no problem for me [14:57:17] waaaiiitttt [14:57:27] will coordinate! [14:57:31] elukey: unless you know how to make it show up! [14:57:38] we need /dev/sdh [14:57:40] back [14:57:42] ottomata: have you killed/restarted namenodes or something ? there are lot of jobs errors :( [14:57:48] the last time we stopped cassandra on aqs1001 the pageview API stoppeD! [14:57:53] ottomata, joal --^ [14:57:54] joal: i am restarting one nodemanager every 3 mins [14:58:18] ottomata: ok, jobs have failed (see alerts) [14:58:22] I think we fixed the quorum problem, but we need to test it :D [14:58:42] so please before rebooting just stop restbase and cassandra with a nodetool drain before [14:58:47] joal: i see one refine job [14:58:48] fail? [14:58:49] ja? [14:58:49] and see what happens [14:59:11] haha, elukey i'm not rebooting it yet! [14:59:14] coordinating with yall [14:59:34] k ottomata [14:59:51] ottomata: on hadoop, 3 load and 1 refine failed [15:00:02] !log restarting failed jobs [15:00:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:00:11] ottomata: I knooowww just wanted to tell you [15:00:54] hehe :) [15:02:26] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2203442 (elukey) stat1004 is ready for a test! @Ottomata would you be the first one to try it? :P [15:02:30] joal: hmm, ok, sorry about that [15:02:44] ottomata: ok, restart, done :) [15:02:49] there are still more restarting, this shouldn't cause failed jobs though...although [15:02:55] i see some of the restarts say [15:03:02] nodemanager did not stop gracefully after 5 seconds: killing with kill -9 [15:03:17] ottomata: might be the thing [15:03:39] elukey: stat1004 works for me! [15:03:53] ottomata: happened to me too [15:04:00] elukey: is there a ticket open for these timeouts/high latency? [15:04:15] urandom: good point, not sure.. joal? [15:04:56] urandom: I'll try to study a bit cassandra compaction and then I'll re-read what you wrote, it makes sense but there are some question marks [15:04:59] :D [15:05:21] urandom: https://phabricator.wikimedia.org/T124314 [15:05:31] But it's not yet in our task list [15:06:00] elukey: sorry, i'm never sure how much detail to go into in a venue like this; it's a rather broad subject [15:06:11] elukey: i wonder if the ones that aren't restarted nicely are ones that would soon OOM [15:06:56] ottomata: nah because it was happening for each restart that I did in the past, I wouldn't worry [15:07:04] oh really? [15:07:07] yep [15:07:07] not all of them do that though [15:07:37] I also tried with a higher timeout in salt, but those 5 seconds are somewhere in the stop script probably (never checked) [15:07:51] urandom: nope I have to thank you, leaning tons of things :) [15:10:06] ah elukey the stats user commits to gerrit via http auth [15:10:07] not ssh [15:10:34] hmmm geowiki pull? [15:10:34]