[05:20:03] btw I created https://phabricator.wikimedia.org/T143254 for Erik yesterday [07:31:05] Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2563453 (Danielsberger) The 1 day data set looks great and the trends (e.g., for one-hit-wonders) follows the expectations from previous comments. - "backwards consistent": I've compared... [12:09:53] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:56] PROBLEM - YARN NodeManager Node-State on analytics1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:42] PROBLEM - Hadoop HDFS Zookeeper failover controller on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController [12:11:45] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:48] PROBLEM - YARN NodeManager Node-State on analytics1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:52] PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:56] PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:59] PROBLEM - YARN NodeManager Node-State on analytics1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:01] PROBLEM - YARN NodeManager Node-State on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:04] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:07] PROBLEM - YARN NodeManager Node-State on analytics1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:11] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:35] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:14:06] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [12:14:09] RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING [12:14:14] RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING [12:14:24] RECOVERY - YARN NodeManager Node-State on analytics1053 is OK: OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING [12:14:27] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [12:14:30] RECOVERY - YARN NodeManager Node-State on analytics1046 is OK: OK: YARN NodeManager analytics1046.eqiad.wmnet:8041 Node-State: RUNNING [12:14:33] RECOVERY - YARN NodeManager Node-State on analytics1029 is OK: OK: YARN NodeManager analytics1029.eqiad.wmnet:8041 Node-State: RUNNING [12:14:36] RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING [12:14:40] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [12:14:42] HMMM [12:14:43] uhhh [12:15:05] RECOVERY - YARN NodeManager Node-State on analytics1036 is OK: OK: YARN NodeManager analytics1036.eqiad.wmnet:8041 Node-State: RUNNING [12:15:25] PROBLEM - Kafka Broker Server on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [12:17:46] RECOVERY - Kafka Broker Server on kafka1020 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [12:19:27] RECOVERY - Hadoop HDFS Zookeeper failover controller on analytics1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController [15:00:45] ottomata: standduppp? [15:30:45] ottomata: taskingggg [16:48:46] Analytics-Cluster, Analytics-Kanban, Operations, netops: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2564917 (Ottomata) [17:23:41] hii milimetric [17:23:57] qq on marko's comment here [17:23:57] https://gerrit.wikimedia.org/r/#/c/301284/15/jsonschema/mediawiki/page/delete/1.yaml [17:24:03] do you think i shoudl call the is_bot field [17:24:05] user_is_bot [17:24:05] ? [17:49:11] ottomata: yeah, probably better to be consistent, either all user_ or none, and I agree with you that user_ is redundant but consistent with the rest of the schemas. [17:52:49] ok cool [17:52:50] done then [17:52:54] milimetric: i can chat now if you want [17:52:59] about oozie + sqoop [17:53:07] sure [17:53:10] to the cave! [18:18:36] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Puppetize and deploy MirrorMaker using confluent packages - https://phabricator.wikimedia.org/T134184#2565310 (faidon) [18:18:40] Analytics-Cluster, Analytics-Kanban, Operations, netops: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2565307 (faidon) Open>Resolved a:faidon Should be done! [18:36:31] urandom: hola! another question about compaction if you may [19:00:40] milimetric, will join call shortly [19:01:38] (np, I'm here) [19:03:32] nuria_: ask away! [19:03:54] urandom: this time i have read a bunch of docs.. [19:04:12] nuria_: and you want the last 6 hours of your life back? [19:04:25] :) [19:04:34] urandom: jaja, it was more like 2 hours but yeah, i kind of get it better now [19:05:05] urandom: what i do not understand: if we write every piece of data once (i.e. no updates) [19:05:19] every piece of data (row) shoudl be only in 1 sstable, correct? [19:06:11] nuria_: umm, no, probably not... let me have a look at your data model real quick [19:06:15] *should [19:06:34] urandom: would love to know the command to do that [19:10:25] nuria_: i just looked at the schema (using describe table "local_group_default_T_pageviews_per_article_flat".data; from cqlsh) [19:11:04] nuria_: so your partition key is ("_domain", project, article, granularity) [19:12:47] nuria_: so every time you make a write with the same ("_domain", project, article, granularity) you used in a previous write, that'll almost certainly land in a different sstable [19:13:50] data written weeks ago might have been merged up into say level 3, and when you write a new value today, that'll be flushed into level 0 [19:14:07] which might one day get merged into the same table in level 3 [19:14:47] or might not, if the aforementioned write has already been merged into level 4, say [19:16:17] urandom: ah i see, the partition determines the sstables, ok [19:17:23] urandom: I run some load tests today and ya, compaction doesn't seem to affect throughput and average response times [19:18:02] nuria_: one alternative to leveled compaction, the default in fact, is size-tiered [19:18:14] which requires a lot less to keep up [19:18:29] and leveled does size-tiered with in level 0 [19:18:57] so if you really overrun what leveled can do to keep up, you're basically left with size-tiered [19:19:20] if you overrun that chronically, then things will definitely fall apart :) [19:20:37] so TL;DR provided level 0's size-tiered compactor can keep up, and provided you have the capacity to catch up on the leveled compaction *eventually*, then leveled might be the way to go [19:20:41] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2565508 (Tbayer) >>! In T141506#2523548, @BBlack wrote: > All the anomalous stuff I'm looking at definitely points at Chrome/41.0.2272.76 on Windows (10, 8, 7),... [19:21:29] nuria_: i guess what we really need to know is whether leveled can get caught up, and stay that way, once the back fill is done and you're down to your normal daily import [19:21:50] urandom: given that size tiered compaction says that the number of sstables that data might be spread across hovers in about 10 but we see about 15 in our queries it seems that changing to leveled compaction was the thing to do, correct? [19:22:31] leveled will always result in the most optimal reads, if you have the throughput for it [19:23:06] because yes, reads will always come from the fewest possible sstables [19:24:04] nuria_: time-windowed is the most optimal for your use case though, algorithim-wise [19:24:22] nuria_: because your data set is total ordered [19:24:51] urandom: ya, but to use that we need a cassandra version that we do not need to spend significant time packaging and it seems that version doesn't exist yet, correct? [19:25:10] so you'd get the least number of sstables per read, with the most minimal amount of compaction activity [19:25:34] nuria_: no, we'd just need to build a jar with the compaction strategy, and add it to the classpath [19:25:53] nuria_: and btw, you can locally override the compaction strategy to test [19:26:02] ephemerally [19:26:34] so you could add it to the classpath, poke something with jmx, and watch what happens [19:26:48] if the result is not good, just poke it back [19:27:43] i'd have to look into what this means for leveled, because i think metadata about levels is stored in the tables [19:27:53] i mean, i know wouldn't corrupt or anything [19:28:13] i imagine it would force all tables lacking the metadata, those created during the test, back through level 0 [19:28:58] anyway, something to think about [19:30:20] nuria_: i wonder if we could test this in labs, or restbase staging [19:30:40] nuria_: do you have a labs env? [19:36:30] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2565537 (Nuria) @Tbayer: you can read plenty details on TLS issues here: https://phabricator.wikimedia.org/T141786 and the several tests the traffic team did on... [19:36:43] urandom: we do have one for aqs [19:36:50] urandom: it doesn't have data though [19:37:51] urandom: and i am not sure of size of boxes, without big enough boxes test might not be so realistic [19:38:18] urandom: but overriding compaction will need to be done for the whole dataset correct? [19:38:38] urandom: as you cannot have two different compactions side-by-side [19:39:49] nuria_: it would be for a table, and on one machine [19:47:26] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2565564 (BBlack) I don't have a lot of firm information really. A lot of what we're going on here is guesses as to the exact mechanism and nature of the broken... [19:54:48] milimetric: you got a sec for a kasocki brainbounce? [19:56:34] yes, but in a little bit 'cause i'm talking to Aaron now [19:56:43] k [19:59:50] ok ottomata let's do it! [20:00:00] k [20:18:54] ottomata, milimetric, mforns : shared blogpost to announce browser dashboards. Might be the most concise blogpost of all times. Feedback welcome. [20:19:03] :) [20:19:11] k, will look later today [20:19:13] nuria_, ok [21:34:17] (PS23) Milimetric: [WIP] Oozify sqoop import of mediawiki tables [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) [21:34:42] (PS24) Milimetric: Oozify sqoop import of mediawiki tables [analytics/refinery] - https://gerrit.wikimedia.org/r/303339 (https://phabricator.wikimedia.org/T141476) [21:36:26] ok, ottomata, if you want to take a look at my changes they're between 22 and 23 except for the folder move (so you can see the changes): https://gerrit.wikimedia.org/r/#/c/303339/22..23//COMMIT_MSG [21:39:33] ottomata: so will the _SUCCESS flag work ok if we're doing that weird swapping thing in the end? Where I save to -temporary and then move to the output? Because it'll just look like the old dataset was deleted and then re-appeared with a _SUCCESS flag already [21:39:46] maybe writing _SUCCESS should be done in oozie with a shell action? [21:40:45] anyway, we can chat tomorrow, night!