[03:25:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update UA parser - https://phabricator.wikimedia.org/T189230#4173485 (10Tbayer) Thanks for working on this! As @Nuria points out in the task description, it looks important for data quality to keep this updated. Back in 2015 (T106134), @dr0ptp4kt suggested... [04:16:45] (03CR) 10Zhuyifei1999: [C: 031] "@Framawiki: Feel free to deploy it." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/427020 (https://phabricator.wikimedia.org/T117644) (owner: 10Framawiki) [06:51:30] morning! [06:51:52] so good news is that Druid 0.10 seems behaving fine on 100[1-3] nodes [06:52:29] coordinator and overlord UI look good [06:53:02] so we could think about moving 100[4-6] to 0.10 [07:26:14] joal: o/ [07:26:35] I tried to remove Java 7 from the druid nodes but of course zookeeper depends on it [07:27:15] so I'd upgrade zookeeper on druid100[123] to the version that we have on conf* hosts [07:27:19] 3.4.9 [07:27:19] with https://gerrit.wikimedia.org/r/#/c/430298/1/hieradata/role/common/druid/analytics/worker.yaml [07:27:27] so we'll be finally ready to [07:27:33] 1) use only java 8 on those nodes [07:27:40] 2) upgrade to Stretch anytime [07:27:52] it should take 10/15 minutes in total [07:31:26] * elukey doing it [07:37:15] * joal watches while elukey does it :) [07:41:55] druid1001-2 done! [07:41:58] (followers) [07:42:05] waiting a bit and then doing 1003 [07:42:08] k [07:42:41] also elukey, there are onf changes we should apply on druid for pivot to work, and to test a new feature (sql querying) [07:42:59] joal: sure! [07:43:26] elukey: this is why pivot fails (I have no clue why on banners only though ) https://github.com/druid-io/druid/pull/3818 [07:44:14] now the doc: http://druid.io/docs/0.10.0-rc2/development/javascript.html [07:44:48] and for sql: http://druid.io/docs/0.10.0-rc2/querying/sql.html [07:46:00] basically two parameters to set to true in runtime properties file: druid.javascript.enabled = true and druid.sql.enable = true [07:47:38] the sql one looks awesome, the js one a bit less [07:51:36] agreed elukey - When we are sure people are not using pivot anymore (or we turn it off), we'll be able to get rid of it [07:52:08] elukey: have you drained overlord for RT job? [07:52:54] joal: what do you mean? [07:53:20] elukey: I thought the k change needed druid restart? [07:53:37] nono it doesn't [07:53:44] Ah ok ok [07:53:47] easier then :) [07:53:50] it is like a regular rolling restart of zk [07:53:55] awesome [07:54:21] that is super good so we can see if the new overlord are more resilient to zookepeer changes [07:54:25] :) [07:54:38] very interesting indeed !! [07:56:19] now we'd be fully ready to reimage druid100[1-3] to stretch [07:56:33] we are ready sorry [07:56:41] elukey: Do you want to do it now\./ [07:56:42] ? [07:57:15] nono I think after 0.11 [07:57:27] ok elukey - you're the one to decide :) [07:58:34] joal: do we want to upgrade druid public ? [07:58:49] or do you prefer to test the sql/js druid things on analytics? [07:59:40] elukey: sql + js is only for analytics (I don't think we're gonna make them available for public, or at least not now) [08:00:11] oh yes yes [08:00:14] elukey: Since it's relatively early, I'd go for puvlic upgrade, like that if something get's problematic, we have time to check [08:00:22] ack! [08:00:26] s/puvlic/public [08:00:46] so I have the prep work ready https://gerrit.wikimedia.org/r/#/c/430296/1/hieradata/role/common/druid/public/worker.yaml [08:01:23] the druid.processing.buffer.sizeBytes reduction is the same that andrew applied to his patch when he attempted the upgrade to 0.10 the first time (Already running in analytics/private) [08:01:39] there was a link about a motivation for this, but it seems working fine [08:04:02] works for me elukey - I we see a perf hit, we'll discuss :) [08:04:20] ack [08:04:35] all right merging, running puppet and then upgrading [08:04:42] I'll use the same procedure as yesterday [08:06:42] very interesting thing - from 0.10 onwards there is an experimental feature to merge coordinator and overlord daemons [08:06:51] ack elukey - We shoud add a post-restart, just to make sure the segment issue doesn't pop up [08:09:13] I am wondering if it was due to middlemanagers not being upgraded when historicals were already to new version [08:09:34] anyhow, I'll watch closely historical logs, now we know where to look :) [08:10:07] same here elukey - tailing -f hisorical :) [08:15:48] ok staring with 1004 [08:18:05] ah I might have found why we had an issue yesterday [08:18:44] elukey: ?? [08:19:26] elukey --verbose [08:20:12] (I am upgrading will add more in a sec :P) [08:25:37] elukey: errors on d1004 :( [08:26:23] ack, checking [08:26:35] elukey: same is yesterday [08:26:45] in the historical logs? [08:26:50] yessir [08:27:18] so all the historical restarted [08:27:40] the thing that I was saying before is that I noticed a weird thing [08:27:50] namely when I install the new package, the systemd unit reloads [08:28:13] but in the logs I don't see every time the full historical bootstrap, namely "loading segments etc.." [08:28:18] meanwhile after a restart I do [08:30:09] so joal I don't see those weird errors repeating now, and the coordinator says that all segments are loaded [08:30:12] mmmm [08:30:26] maybe it depends on queries? [08:33:38] elukey: I don't think it does :( [08:34:27] anyhow, let's keep going with the upgrade and check logs further on ok? [08:36:52] proceeding with overlords [08:38:04] ack elukey ! [08:40:11] now middlemanagers [08:45:41] doing brokers now, depooling/repooling each time [08:46:57] done! [08:47:00] last ones, coordinators [08:49:20] aaand done [08:50:05] great elukey :) [08:50:13] segments seems fully loaded [08:50:19] I have seen errors on d100[4|5], but everything looks good [08:50:57] indeed, stuff looks good :) [08:51:02] \o/ [08:51:09] We should monitor that no more errors occur though [08:52:04] yep yep [08:52:27] now that I think about it, it would be great to have an alarm on coordinator's segments loaded percentage [08:56:27] true elukey [08:57:26] so basically alarming on [08:57:30] 1) general availability of brokers [08:57:50] 2) percentage of segments loaded dropping below 100% [08:57:56] (for x amount of time) [09:08:17] joal: https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All&from=now-2d&to=now&panelId=46&fullscreen [09:08:26] there is always a metric! :P [09:08:58] * joal bows tho the master of metrics [09:11:20] (aggregated by datasource now, a bit clearer [09:11:56] I think it would be more useful to have the entire dashboard like the analytics hadoop one, all rows collapsed [09:11:59] BUT [09:12:17] myself of the past thought that it was a good idea to split rows belonging to the same daemon in multiple rows [09:12:21] * elukey blames elukey of the past [09:21:30] joal: restyling today :) [09:21:31] https://grafana.wikimedia.org/dashboard/db/druid [09:21:39] renamed the dashboard and collapsed the rows [09:31:02] Many thanks for that elukey :) [09:31:50] I am adding alerts for unavailable segments [09:31:56] warning 5 segments, critical 10 [09:32:08] for both clusters [09:32:42] https://gerrit.wikimedia.org/r/430312 [09:34:32] we have them now :) [09:35:28] very good that we are experiencing these failures before going all in with wikistats [09:36:50] also re-enabled notifications for banner impression [09:41:11] many thanks elukey :) [09:42:28] elukey: do we test js + sql? [09:50:39] joal: sure, what do you think about after lunch? If you have 10 mins now I'd like to have a chat with you on the cave [09:51:07] (in is better) [09:56:03] elukey: OMW ! [10:40:18] * elukey off for a couple of hours! [10:46:15] 10Analytics, 10EventBus, 10JobRunner-Service, 10MediaWiki-Database, and 5 others: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending - https://phabricator.wikimedia.org/T191282#4174183 (10jcrespo) Most of the warnings (I suppose the logging was changed to understand which component relates to) s... [11:01:20] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10wikitech.wikimedia.org, 10Services (done): Transfer wikitech jobs to Kafka queue - https://phabricator.wikimedia.org/T192361#4174232 (10mobrovac) [11:20:15] (03CR) 10Joal: [V: 032 C: 032] "Merging!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans) [11:21:20] thank youuuu joal ! [11:21:28] :) [11:21:32] Hi fdans :) [11:21:48] I'm gonna try to deploy refinery-source again - I hope it'll work [11:22:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add a --dry-run option to the sqoop script - https://phabricator.wikimedia.org/T188556#4012066 (10fdans) [12:11:45] (03PS1) 10Joal: Reverting maven commits after failed deploy (3rd time) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/430355 [12:12:41] (03CR) 10Joal: [V: 032 C: 032] "Merging to attempt deploy again" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/430355 (owner: 10Joal) [12:15:38] man are user agent strings weird these days: [12:15:48] https://usercontent.irccloud-cdn.com/file/2esHQ6qq/Screen%20Shot%202018-05-02%20at%202.13.55%20PM.png [12:16:21] nice one :) [12:19:25] 10Analytics, 10Analytics-Wikistats: Missing stats for Atikamekw Wikipedia on stats.wikimedia.org - https://phabricator.wikimedia.org/T193625#4174447 (10Reedy) [13:20:36] !log restart druid broker on druid100[1-3] to enable druid.sql.enable: true [13:20:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:55] joal: --^ [13:25:38] (03CR) 10Milimetric: "pending the switching discussion, just one syntax suggestion" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387) (owner: 10Joal) [13:26:33] milimetric: o/ [13:26:39] heyo elukey [13:27:07] so druid 0.10 is running on both clusters [13:27:12] everything looks o [13:27:13] ok [13:27:22] I am going to upgrade also zookeeper on druid100[456] [13:27:30] so we'll run only java 8 in there [13:27:52] on druid100[123] http://druid.io/docs/0.10.0/querying/sql.html is enabled too [13:33:23] (03CR) 10Milimetric: "Sorry I missed this patch! Thanks for taking it on." (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/425925 (https://phabricator.wikimedia.org/T185533) (owner: 10Sturmkrahe) [13:34:17] woah elukey [13:34:19] sql!!! [13:34:26] you're the best secret santa ever [13:35:28] heya team :] [13:37:50] o/ [13:39:10] all right zookeeper upgrade to 3.4.9 everywhere [13:39:26] so in theory we are ready to reimage the druid nodes to stretch anytime [13:41:50] yeehaw!!!! [13:41:51] :) [13:42:26] :) [13:44:08] mforns: I'd love thoughts on https://gerrit.wikimedia.org/r/#/c/430113/ so I can merge/deploy before I'm off Friday [13:44:19] and hi good morning, how are you :) [13:44:21] milimetric, looking :] [13:46:04] Hi milimetric - Just noticed: the datasource for geowiki on druid is using dashes - we should update that for underscores :) [13:46:31] elukey: I confirm SQL on druid works :) Except for datasources with a dash in them :) [13:46:37] Just tried a dummy query [13:48:09] oh balls, sorry I missed that [13:48:14] it's so confusing!!! [13:48:25] milimetric: it completely is ... [13:48:54] milimetric: I hope at some point we'll disable pivot altogether, rename datasources and update superset dashboards ! [13:48:55] 'cause the old datasources have dashes [13:49:39] !log beginning upgrade of kafka-jumbo brokers from 1.0.0 -> 1.1.0 : T193495 [13:49:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:49:44] joal: I mean, we can just rename them, I never thought it was such a big deal to break people's bookmarks. We can even customize the pivot routing to replace - with _ [13:49:44] T193495: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495 [13:50:13] milimetric: let's make sure ZeBoss agrees :) [13:50:53] of course, yeah [13:51:38] milimetric: curl -v POST -H'Content-Type: application/json' http://druid1001.eqiad.wmnet:8082/druid/v2/sql/ -d '{"query":"SELECT http_status, COUNT(1) as c FROM webrequest group by http_status order by c desc limit 10"}' [13:51:42] :D [13:51:44] Mwhahahaha ! [13:51:50] This is super awesome [13:53:34] yeah, joal, hm.......... now we can put up an instance in labs and point quarry to it?! [13:54:49] also, wow, we have a LOT of 404s [13:55:00] milimetric: We could - I'd still favor presto for this usage though [13:55:24] yeah, but it might be harder to stand up on labs? [13:55:28] that's up to ops [14:00:58] 10Analytics, 10Analytics-Kanban: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495#4174671 (10Ottomata) [14:01:21] ottomata: o/ [14:01:25] already started the migration? [14:01:50] there is a openjdk-8 upgrade pending, but I think that we don't have the new deb yet deployed [14:02:00] if we could pack the upgrade with the new jvm it would be awesome [14:02:06] otherwise I'll do it later on [14:02:24] hmmm, i just started with one broker [14:02:31] when can we get the .deb? [14:03:47] ottomata: nevermind, will do it later during the next days [14:03:59] not really extremely urgent :) [14:04:02] oook [14:11:11] joal: wanna talk quick about how I understand lookups and why I might be dumb but I don't know it yet? [14:11:23] milimetric: To the cave! [14:13:02] 10Analytics, 10Analytics-Kanban: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495#4174728 (10Ottomata) [14:13:26] a-team: one thing to discuss is http://druid.io/docs/0.10.0-rc2/development/javascript.html (Joseph brought it up this morning). Now with Druid 0.10 javascript is disabled by default, since there is no proper sandboxing.. The current issue is that real time data for banner impression is not working, but I'd rather ask them to migrate to SuperSet rather than enabling JS [14:13:48] (re-enabling JS) [14:14:07] elukey: on that matter - I spent a few minutes with Joseph Seddon, he seems happy with superset :) [14:14:15] * elukey dances [14:14:30] one of Joseph's workers already fixed the issue [14:14:45] (French Joseph!) [14:15:13] then nevermind, I am super happy :) [14:16:55] !log Refinery-source version 0.0.63 finally released to Archiva! [14:16:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:27:04] just added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Handling_alarms_for_unavailable_segments [14:27:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008#4174843 (10elukey) Added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Handling_alarms_for_unavailable_segments [14:28:25] 10Analytics, 10Analytics-Kanban: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495#4174859 (10Ottomata) [14:29:31] 10Analytics, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4174880 (10elukey) All druid nodes are running Druid 0.10 and zookeeper 3.4.9, we can do the work anytime. [14:34:16] stepping away from keyboard for ~30 mins (errand for the new home), ping me if needed [14:35:05] (03PS1) 10Joal: Bump jar versions for v0.0.63 features [analytics/refinery] - 10https://gerrit.wikimedia.org/r/430387 [14:38:30] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/430387 (owner: 10Joal) [14:38:38] git up [14:38:40] oops [14:45:50] !log Deploying refinery using Scap [14:45:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:20:34] !log Deploying refinery to hadoop [15:20:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:21:53] 10Analytics-Kanban, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650#4175147 (10Milimetric) [15:22:05] 10Analytics-Kanban, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650#4175159 (10Milimetric) p:05Triage>03Normal [15:30:32] ottomata: since you are upgrading, do you want to skip ops sync? [15:30:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495#4175190 (10Ottomata) [15:31:02] hm, elukey i don't have too much to talk about, could do [15:31:03] joal? [15:31:08] works for me [15:31:10] k! [15:31:13] b [15:33:36] ottomata: there is only one task that we should discuss but not really extremely urgent [15:34:01] ok [15:34:05] anyhow, let's skip it :) [15:34:06] which one? irc discuss? [15:34:08] :) [15:38:55] mforns + joal: I was so confused because this data is so broken, the dimension is actually page not revision, so it would look like this: [15:38:59] https://www.irccloud.com/pastebin/m22hU0sm/ [15:40:04] so I was definitely saying some things that were wrong, my apologies [15:40:05] makes sense milimetric [15:40:22] but in my brain somewhere there was something going, no, this can't be, facts don't change [15:40:23] :) [15:40:45] no, no, I think at some point we all were flying in the mayonnaise [15:41:02] ahahaha, I should've finished that photoshop picture of that, such an awesome saying [15:41:32] you started a photoshop pic of that? lol [15:42:04] yeah, I found a bowl of mayo and took your staff pic, but it never looked like you were flying [15:42:11] * joal now dreams of mayonnaise filght [15:42:20] we clearly need to properly stage this next time we see each other [15:42:30] I'll get a giant bowl and start collecting eggs and oil [15:42:57] xDDDDD [15:43:53] it's literally translated from a brasilian saying [15:44:03] *brazilian [15:45:03] elukey: have a minute for batcave? [15:45:10] joal: sure! [15:50:13] mforns,nuria_: the differences in ua parsing between the previous version and the current are super interesting, they extend much much further than just Windows versioning :D [15:50:30] fdans, cooool :] [15:50:33] fdans: what are other examples? [15:50:54] although... this means more work :'( [15:51:00] oh yeah [15:51:08] nuria_: let's talk in standup about it :) [15:52:03] fdans: sure, you can get 1 day of navigation timing data and apply new and old library to see what differences results lay, that way perf team do not need to do that same work [15:53:10] nuria_: I've been getting samples of UAs from webrequest and running the old version of ua_parser and the new one on them [15:56:52] fdans: the python client right? [15:56:58] yes [15:57:08] fdans: for how much traffic, an hour? [15:57:18] a day! [15:57:33] fdans: sounds good. [16:01:59] (03CR) 10Mforns: [V: 032 C: 032] "LGTM!" (031 comment) [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/430113 (https://phabricator.wikimedia.org/T126279) (owner: 10Milimetric) [16:02:12] ping milimetric standdup [16:05:27] !log Restart oozie webrequest bundle after deploy [16:05:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:11:00] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [16:11:13] h! [16:11:14] hm! [16:11:22] probably due to cluster restarts? checking... [16:12:06] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [16:12:09] doesn't look good [16:12:42] yeah strange. [16:12:48] it looks stuck but not sure why, the other consumers are fine. [16:12:49] bouncing it [16:13:00] May 2 15:43:53 eventlog1002 eventlogging-consumer@mysql-m4-master-00[29850]: 2018-05-02 15:43:53,098 [29850] (MainThread) Error sending OffsetCommitRequest_v2 to node 1003 [ConnectionError: socket disconnected] [16:13:29] now it seems to work [16:13:31] weeeird [16:14:05] !log bounced eventlogging-consumer@mysql-m4-master-00 after kafka jumbo 1.1.0 upgrade [16:14:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:14:37] ( elukey: i have to admit, i was reluctant to go away from upstart for EL, but this is def better :) _ [16:14:39] ) [16:15:33] \o/ [16:17:07] !log Restart oozie mediawiki-history-denormalize job after deploy [16:17:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:24:08] 10Analytics, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650#4175295 (10Milimetric) [16:25:25] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4175301 (10Ottomata) [16:27:46] !log 2018-05-02T14 webrequest dataloss warnings have been checked and are false positives [16:27:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:35:12] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4175324 (10Ottomata) [16:38:10] (03PS1) 10Joal: Add id.wikimedia to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/430407 [16:38:12] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [16:41:50] !log Manually silence pageview-whitelist alarm overwriting /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv [16:41:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:50:21] (fyi i am reimaging an41 and an42) [16:52:54] (03CR) 10Nuria: [V: 032 C: 032] Add id.wikimedia to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/430407 (owner: 10Joal) [16:59:30] ottomata: ack! [16:59:40] I wanted to do two but will skip for today :) [17:00:27] :) [17:01:36] 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add pivot parameter to tabular layout graphs - https://phabricator.wikimedia.org/T126279#2009951 (10Milimetric) @CCicalese_WMF : the pivoting thing is done, let me know when you want to update your reports/graphs and for reference he... [17:02:21] elukey, milimetric , mforns , joal, ottomata fdans super short staff? [17:02:32] we're still in da cave nuria [17:03:14] 0ook [17:03:49] ping mforns [17:14:04] joal,ottomata - as FYI I am working with Services and Filippo to https://gerrit.wikimedia.org/r/#/c/430399/, that is renaming/labeling cassandra 2.x metrics to 3.x (basically s/columnfamily/table) on the prometheus side to share dashboards [17:14:41] so metrics reported by the prometheus jmx agent will not be the final ones on prometheus [17:16:09] k cool [17:19:31] logging off!! byeee [17:33:53] !log Rerun webrequest-load-wf-text-2018-5-2-15 [17:33:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:48:28] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:48:44] ottomata: is that you --^? [17:48:48] oh yes! [17:48:52] but it shoudlb e downtimed! [17:48:52] ok :) [17:48:58] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:48:59] hmm maybe i took too long... [17:49:09] oh yes [17:49:11] def i did [17:49:16] sorry, re downtimed [17:49:37] np - Thanks for reimaging those beasts [17:50:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4175642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1042.eqiad.wmnet'] ``` T... [17:50:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4175643 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1041.eqiad.wmnet'] ``` T... [17:50:56] I like today's XKCD :) https://xkcd.com/1988/ [17:53:04] haha [17:56:24] (03PS1) 10Zhuyifei1999: Set CELERYD_PREFETCH_MULTIPLIER to 1 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/430422 [18:00:19] (03CR) 10Zhuyifei1999: [C: 032] "I'm self-reviewing this because things got really bad (oldest queued query was around 40mins) just now..." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/430422 (owner: 10Zhuyifei1999) [18:00:46] (03Merged) 10jenkins-bot: Set CELERYD_PREFETCH_MULTIPLIER to 1 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/430422 (owner: 10Zhuyifei1999) [18:10:48] woah nuria_ turnilo! [18:12:40] milimetric: did you see nuria's email? [18:16:27] not yet - lookin [18:16:51] (03PS4) 10Nuria: [WIP] UA parser specification changes for OS version [analytics/ua-parser/uap-java] (wmf) - 10https://gerrit.wikimedia.org/r/429527 (https://phabricator.wikimedia.org/T189230) [18:17:53] (03CR) 10Framawiki: [C: 032] Add export to HTML [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/427020 (https://phabricator.wikimedia.org/T117644) (owner: 10Framawiki) [18:17:57] (03Merged) 10jenkins-bot: Add export to HTML [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/427020 (https://phabricator.wikimedia.org/T117644) (owner: 10Framawiki) [18:19:32] (03PS5) 10Nuria: [WIP] UA parser specification changes for OS version [analytics/ua-parser/uap-java] (wmf) - 10https://gerrit.wikimedia.org/r/429527 (https://phabricator.wikimedia.org/T189230) [18:20:48] ottomata: via faidon [18:20:54] ottomata: it is new as of jan [18:20:57] cc milimetric [18:21:07] the other one was dead [18:21:08] dead [18:21:37] nuria_: yeah, it's exciting but I'm cautious because we got burned before [18:22:11] I mean I guess we have nothing to lose if we install it and it goes kaput we can keep superset running [18:22:29] and pass on the woes of OSS to our customers :) [18:24:07] 10Quarry, 10Patch-For-Review: Add 'download in HTML format' option (Quarry) - https://phabricator.wikimedia.org/T117644#4175766 (10Framawiki) 05Open>03Resolved [18:47:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4176048 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1041.eqiad.wmnet'] ``` and were **ALL** successful. [18:47:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4176049 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1042.eqiad.wmnet'] ``` and were **ALL** successful. [19:05:51] Hi ottomata ! Do you have a minute for a question? [19:06:33] chelsyx: heeeyaa [19:06:36] for sure! [19:06:37] what's up? [19:07:20] ottomata: should we use `enum` whenever possible? what's the benefit? [19:08:03] Also want to confirm with you that adding/deleting items in `enum` won't break anything [19:08:51] chelsyx: the only benefit i can see to enum is you get validation of the values in the field. [19:08:56] on the hive side of things, it will be a string [19:09:15] since right now (in hive refine logic) the schema is inferred from the data, not the eventlogging jsonschema [19:12:06] ottomata: Got you. Can I use `enum` for a optional field? [19:15:47] yes that's fine [19:15:58] chelsyx: the way the stuff on the hive side works (for now): [19:16:16] the code runs, and loads a Spark DataFrame using the .json dataframe reader [19:16:24] this passes over an hours worth of data [19:16:32] and merges the fields of every record to build a dataframe [19:16:47] so if one record is missing a field that another has, that field will be added to the dataframe, and the records with missing data will have that set to null [19:17:09] if there is something that was encountered in a past hour [19:17:17] the hive table schema is 'merged' with the incoming dataframe schema [19:17:34] adding any missing fields from the hive table schema to the incoming one, and setting them to null [19:17:56] ottomata: Thanks! [19:18:33] ottomata: another question: I want to change all integer to double to allow more flexibility, is that a bad idea? [19:19:02] in general yes, very bad [19:19:13] the refine code might be lenient enough now though... [19:19:19] we did a lot of work to make it not die [19:19:24] but it might do weird things? [19:19:32] i think it runs a spark sql CAST...joal, right? [19:20:56] ottomata, chelsyx: I think if you try to change an INT to DOUBLE, the system will keep an int, and try to cast your double into an INT [19:21:05] oh riiight [19:21:07] that's right [19:21:15] the earlier field type on the table is kept [19:21:23] correct ottomata [19:21:28] joal: we could manually alter the hive table field [19:21:37] ottomata: very doable [19:21:40] then future data would cast any seen ints to doubles [19:21:46] ottomata: I wonder how hive / parquet would react [19:21:53] i think i've done it before...buuut eyah [19:21:57] how would selects on old data fair? [19:21:59] probably not well. [19:22:05] i thin, i've only tested that new data does the right thing [19:22:06] I've not played enough with those changes to actually know [19:22:21] chelsyx: i would guess that we could make it work, but we'd have to manually alter the hive table [19:22:36] and, it is likely that old data would no longer be accessible through hive [19:22:41] but we aren't sure about that [19:23:12] ottomata joal: I don't have any data in the table now https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSReadingLists. [19:23:26] for a generic field like `measure` [19:23:42] chelsyx: oh if you don't have any data yet [19:23:48] like, the event. hive table doesn't exist [19:23:49] I'm thinking about setting the type to double, although I expect most of the value to be integer [19:23:52] then you can change away :p [19:24:22] ottomata: yeah, I'm asking is there any downside for storing integer as double [19:24:27] chelsyx: if you ever think you will have doubles, you should go with dobules [19:24:33] chelsyx: if no data exist, we should double check the table has not been created before you change, but yeah, no data == no table for us :) [19:24:54] chelsyx: only downside is...smaller max integer? :p [19:25:10] maybe slightly less efficient for certain arithmetic? [19:25:13] working with INTs in a double type is actually dou-able [19:25:15] we won't notice though [19:25:21] joal is punning! [19:25:31] this doesn't happen that much :-P [19:25:35] haha [19:25:36] hahaha [19:25:44] ottomata joal: Got it. Thanks! [19:31:18] Gone for tonight team - See you tomorrow! [19:32:47] laters! [20:09:15] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4176262 (10Ottomata) [20:09:50] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#3315367 (10Ottomata) [21:13:56] (03PS1) 10Framawiki: Remove call to nonexistent list.js [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/430495 [21:53:56] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4176640 (10Ottomata) [21:55:05] (03CR) 10Zhuyifei1999: [C: 031] Remove call to nonexistent list.js [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/430495 (owner: 10Framawiki) [21:59:38] 10Analytics, 10Research-Archive: geowiki data for Global Innovation Index - 2017 - https://phabricator.wikimedia.org/T178183#4176726 (10DarTar) 05Open>03Resolved