[07:35:30] o/ [07:35:47] oozie emails for consistency checks dropped considerably \o/ [07:36:21] it was during the weekend though, let's keep it monitored during these days, but it looks promising [07:39:31] * elukey commutes to the office [08:33:48] elukey: Yay ! [08:33:54] elukey: congrats :)n [08:40:18] \o/ [08:40:32] next step is the timeout increase, and we should be done [08:41:08] elukey: Awesome :) [08:41:21] elukey: next steps will be on oozie side (probably for me) [08:41:56] I can do it if you want as low priority task after the timeout [08:43:21] * elukey brb! coffee because my brain does not work this morning [08:43:34] elukey: ir will depend on when we need it finished (based on upload migration to v4) [08:43:43] yep yep [08:43:54] elukey: take your coffee time :) [09:07:24] Analytics, Revision-Slider, TCB-Team, WMDE-Analytics-Engineering, and 2 others: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861#2446685 (Addshore) [09:08:01] (PS8) Joal: Add Druid loading oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/298131 (https://phabricator.wikimedia.org/T138264) [09:55:04] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra, Patch-For-Review: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2446753 (elukey) Summary of the last months of work: * The new AQS cluster has been built and it is up and run... [10:05:17] Morning all! [10:05:45] Heya addshore [10:05:50] o/ [10:23:43] joal: what time does otto notramlly appear? :) [10:23:45] *normally [10:23:58] addshore: 9/10 AM NYC time usually [10:24:04] addshore: usually around 1pm our time :) [10:24:13] cool! [10:24:20] joal: time to chat about vk? [10:24:34] elukey: give me 5 minutes [10:24:59] joal: sure sure.. I meant "chat" as reading my ramblings on irc :) [10:25:05] huhu [10:25:19] elukey: let's batcave in a minute, might be easier? [10:25:37] nah don't worry, writing will help my thinking process [10:25:38] :P [10:25:49] elukey: ok, you can write then :) [10:26:41] Soooo I checked again how VSL manages memory, to establish the effect of the -T timeout (default 120 sec) and -L limit (default 1000 transactions) new parameters for VK [10:27:22] the shmlog is of course not touched by any VSL setting because each daemon (varnishlog, varnishkafka, etc..) can potentially use a different setting [10:28:09] VSL instead gives the possibility to whoever uses it to query the shmlog for a data not read [10:28:31] that will be stored in the process [10:28:34] memory [10:29:06] two options are meant to tune a bit this feature [10:29:56] 1) -T timeout (default 120 sec) - the maximum time between a Start and End timestamp before VSL forces it to "complete". VSL tag carrying the "timeout" error value [10:31:05] 2) -L transaction_limit (default 1000) - limits the total amount of incomplete transactions that a process using VSL is willing to keep in memory before forcing them to completion. VSL tag carrying the "store overflow" error value. [10:31:56] when the -L limit is reached, the oldest transaction not yet completed is forced to do so [10:32:12] in order to avoid breaching the limit [10:32:27] so we have now only hit 1) but not 2) [10:32:51] increasing the timeout of course can remove occurrences of 1) creating new beasts of type 2) [10:33:06] but probably on very very busy servers [10:33:14] (upload here we come!) [10:35:17] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2446812 (JAllemandou) Patch updated, see my comments in gerrit. [10:35:46] does it make any sense? [10:37:01] maybe we'll need to tune -T and -L for upload and text [10:41:51] in the meantime, I am going to update the phab task [10:47:57] elukey: It does make sense indeed [10:48:11] elukey: Thanks for the cassandra ticket update :) [10:48:44] joal: really sorry that I forgot to update it :( [10:48:55] elukey: no bother, done now ;) [10:49:10] elukey: Do we go and deploy aqs for milimetric's patch? [10:51:41] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2446844 (elukey) I checked again how VSL manages memory to establish the effect of the -T timeout (default 120 sec) and -L limit (... [10:52:59] joal: sure [10:53:11] ok elukey, merging the patch [10:53:24] (CR) Joal: [C: 2 V: 2] Upgrade Node version to 4.4.6 [analytics/aqs] - https://gerrit.wikimedia.org/r/297807 (https://phabricator.wikimedia.org/T139493) (owner: Milimetric) [10:53:34] gogogo [10:53:39] :D [10:57:00] (PS1) Joal: Update aqs to 131e9f6 [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/298267 [10:58:00] elukey: can you merge that one? [10:58:35] Oh by the way elukey, have we updated node on beta? [11:00:22] I don't think so [11:00:25] good point [11:04:28] joal: is https://gerrit.wikimedia.org/r/#/c/298267/1/node_modules/hyperswitch/node_modules/preq/index.js shipped with the new dependencies or is it our change? [11:06:00] elukey: See https://gerrit.wikimedia.org/r/297807 sha and /src submodule sha in https://gerrit.wikimedia.org/r/#/c/298267 [11:06:59] elukey: So normally our patch is there [11:07:36] elukey: And from what I have seen at deploy repo update, node 4.4.6 has been downloaded [11:08:08] yeah but the patch is not mentioned in the commit [11:08:17] should it be? This was my question [11:08:24] Which pathc [11:08:25] ? [11:09:47] maybe we are not understanding each other [11:10:01] Seems you're right :) [11:10:01] I saw https://gerrit.wikimedia.org/r/#/c/298267/1/node_modules/hyperswitch/node_modules/preq/index.js and it seemed a patch that we did and that will be deployed [11:10:09] not part of the 4.4.6 upgrade [11:10:23] so I was asking if we need to mention it in the commit log [11:10:29] or if it is part of the upgrade [11:10:43] hm, I actually don't know elukey [11:11:00] elukey: I think it's part of the module dependencies upgrade [11:12:10] elukey: --> In node_modules folder, so not our code [11:12:59] elukey: node_modules are updated by the deploy script when we release new code [11:12:59] ahhh good [11:13:04] lgtm then [11:13:19] elukey: Yay ! [11:13:54] (CR) Elukey: [C: 2 V: 2] "Reviewed with Joseph on IRC, all the changes are related to node upgrade. lgtm even if I don't have tons of knowledge in AQS sw internals." [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/298267 (owner: Joal) [11:14:26] now I need to upgrade AQS Beta [11:14:43] elukey: Would be good, like that I can test-deploy over there :) [11:15:33] do you have the beta host name? [11:15:44] I can't find it and I don't have it on my history :( [11:16:20] ah in the scap repo [11:16:39] elukey: good idea ! [11:16:50] deployment-aqs01.deployment-prep.eqiad.wmflabs [11:18:18] done! [11:18:33] Thanks elukey ! [11:18:49] !log deploying aqs on deployment-prep [11:22:05] !log Succesfull deployment in beta - Deploying aqs on aqs1001 as canary [11:24:18] elukey: aqs1001 deployed, everything fine so far [11:24:31] elukey: do you want me to wait some time before deploying on 2 and 3? [11:24:41] elukey: and actually 4-5-6 [11:32:22] joal: you can proceed with 456 imho but maybe it would be wise to wait a bit for 23 [11:32:59] k elukey [11:33:33] !log Deploying aqs on aqs100[456] (new cluster, no traffic) [11:39:21] Arf elukey, I think I messed up :( [11:39:30] elukey: issue deploying on aqs1004 [11:42:46] joal: sure, what happened? [11:43:11] elukey: Stage 'promote' failed on group 'default' [11:43:13] :( [11:43:27] elukey: from what I see scap didn't manage to restart services [11:44:18] ah yes let me silence the alarms first [11:44:22] then I'll check [11:45:18] elukey: Could have it been beause I used aqs1004 and not aqs1004-a? [11:45:24] so 1001 went fine and 1004 didn't [11:45:25] weird [11:46:01] well that one should be Cassandra no? [11:46:10] in this case aqs seems problematic [11:46:16] ok silenced [11:52:51] joal: have you tried to rollback by any chance? [11:52:56] just to see if it fixes the problem [11:53:16] I believe that we'd need to remove logstash and restart the process to see what is happening [11:53:23] elukey: scap told me it rollbacked, but I didn't do anything more [11:53:26] I can't find any good logs and even logstash is weird [11:53:31] should I try a scap rollback? [11:53:56] no no checking [11:54:23] elukey: sure, sorry :( [11:56:56] nah this is good so we find good ops practices, it is only a testing env luckily :) [11:57:10] the nodejs processes are respawning like crazy [11:57:17] but I don't know where the logs go [11:58:31] elukey: probably because of the logging param in config file, no log is written locaaly [11:58:58] elukey: if you want logs, I think you need to stop, modify manually the config file (to add logs), then restart [11:59:21] elukey: I think that's what we did that last time we had issues (IRRC) [11:59:25] that is completely insane [11:59:55] I mean, insane from an production ops point of view [12:00:31] I believe that in this case the processes are not even able to start, respawning like crazy, and logs are not sent to logstash [12:00:55] elukey: completely possible [12:02:04] {"name":"aqs","hostname":"aqs1004","pid":78,"level":60,"status":400,"err":{"message":"400: bad_request","name":"HTTPError","stack":"Error: Schema change, but no version increment.\n [12:02:08] joal --^ [12:02:46] elukey: wow [12:02:53] at confChanged (/srv/deployment/analytics/aqs/a81c8323a612001d/node_modules/bluebird/js/release/promise.js:504:31)\n at Promise._settlePromise ...etc.. [12:03:39] elukey: Arrrrrf [12:03:53] elukey: I remember now !!! [12:04:01] elukey: We had this issue a while ago already [12:04:28] elukey: aqs100[456] has a new schema, therefore different code [12:04:35] elukey: pffff [12:04:50] elukey: I'm very sorry not to have remembered that :( [12:05:35] elukey: How should we proceed? [12:06:18] elukey: We need a manual code change in aqs to restore state in 1004, but it's now very diofferent from 1005-6 [12:08:23] elukey: Do we rool back manually 1004? [12:11:00] elukey: I think the only thing needed is to change links in deploy folders [12:11:20] elukey: I see in ops chan you're busy repairing MW, let me know when you're back on aqs [12:14:51] yep sorry I am back, mw1261 was the host with my custom apache2 pkg and I was scared :) [12:15:08] np elukey, live traffic first ! [12:16:40] elukey: I let you read my thoughts above [12:17:31] elukey: 2 ways to go : Either rollback 1004, either rollforward5 and 6 and patch them after [12:17:46] nah I'd say to rollback 1004 [12:17:53] elukey: good for me [12:18:24] elukey: I think easiest is to stop aqs, then change links in /srv/deployment/analytics/aqs/deploy-cache [12:18:43] and /srv/deployment/analytics/aqs [12:18:54] ahhh you need that rollback [12:18:57] now I remember [12:19:18] elukey: You recall the manual change we did - That's what we still need [12:19:43] elukey: the folder we need: a38e4d78718b072a70514477c3b268baaf8e1d29 [12:19:52] yeah now is current -> revs/81d44a2aab7f009d362e5874a81c8323a612001d [12:20:00] yup [12:33:18] joal: 1004 up and running [12:33:38] restoring logstash [12:34:11] elukey: Thanks mate ! [12:34:14] but we haven't received [12:34:15] 14:33 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [12:34:18] 14:33 RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.007 second response time [12:34:28] so I am sad since alarming is not working as it should [12:34:29] :/ [12:34:39] elukey: I here you [12:35:10] elukey: I promise I'll try to remember not to deploy aqs100[456] next time :( [12:39:51] nah no problem, we'd need to remember it before going to prod :) [12:40:07] elukey: sure we do ! [12:42:56] a-team - I'm AFK for a while [13:20:35] team I updated https://gerrit.wikimedia.org/r/#/c/295652/8 (varnishkafka's code review) [13:20:44] plus the task [13:37:55] good morning ottomata :) [13:38:36] good mornin! [13:42:06] o/ [13:42:42] ottomata: whenever you have time I have some ramblings about vk to talk about [13:43:16] k gimme a few, am almost through emails... [13:43:27] nono even in a couple of hours [13:43:35] nothing urgent [13:45:50] Analytics: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2447192 (ezachte) [13:50:21] Would it be possible for you to try and look at https://gerrit.wikimedia.org/r/#/c/269467/ again today ottomata ? :) It would be great to try and push it forward a bit! [13:52:50] ok addshore :) [13:53:01] awesome! :) [13:54:00] One thought I had was rather than calling it wmde_analytics was call it something generic such as metric_crons. as essentially all it contains is pulling git repos that have scripts, running the scripts on a cron, and may not necesarrily be specific to wmde! [13:56:14] hm, ja this is very similar to a lot of the stuff in the statistics module [13:56:47] I mean, I guess I could bundle it into the statistics modules too? [13:57:03] ja maybe... [14:00:30] Analytics, MediaWiki-extensions-CentralNotice, Operations, Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2447244 (BBlack) Yes, we can help wipe these out at the Varnish layer, by unsetting blacklisted cookies we see. We've done that... [14:01:37] addshore: i think that might be a good idea...this will likely only ever run on stat type boxes in prod, ja? [14:01:40] stat1002/stat1003? [14:01:45] yeh! [14:01:53] i guess since it needs hive, you intend to run it on stat1002? [14:02:03] Yeh, well, some things need hive, some don't [14:02:10] aye [14:02:16] but you are running them all together [14:02:18] in a single cron script [14:02:20] yeah? [14:02:25] well, 2 [14:02:28] daily, minutely [14:02:29] That was going to be the plan :) [14:03:01] and yeh, everything they need is currently on stat1002, although there is no reason I oculdn't split it up over the boxes [14:03:06] aye [14:03:16] ha, ok, so my biggest bikeshed was going to be the module name...but i think if we move this to statistics module, we can get around that [14:03:29] okay, I'll do that then! :) [14:03:32] ok [14:03:34] and [14:03:37] since it will already be in statistics module [14:03:37] got any ideas for the name within the stats module? ;) [14:03:41] ja [14:03:45] let's just call it [14:03:46] statistics::wmde [14:03:48] i think that's ok [14:03:58] Yup, cool! [14:04:08] the analytics-wmde user name is fine and can stay i think [14:04:52] Analytics: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2447255 (Magnus) An alternative might be a sqlite3 database, per project. Since those are "serverless", all that would be required is code/CPU for generating the files, and disk space. Vanilla s... [14:06:19] addshore: and or your secrets, i can put them in ops private repo [14:06:20] but [14:06:26] we've been using hiera more for this recently [14:06:32] so [14:06:44] you should make variables in your puppet class that look up the secrets from hiera [14:06:52] and then render those variables in your config.erb template [14:06:55] like [14:07:01] hmmm [14:07:07] actually you could make it a single var [14:07:08] maybe [14:07:35] since your config file is just key/value pairs of secrets [14:08:07] $wmde_secrets = hiera('wmde_secrets') [14:08:13] where wmde_secrets in hiera is something like [14:08:42] wmde_secrets: [14:08:42] facebook: XXXXX [14:08:42] google: 12345 [14:08:42] mm-wikidata-pass: YYYY [14:08:42] ... [14:08:49] and then in the config.erb template [14:10:32] <% @wmde_secrets.sort.each do |k| -%> [14:10:32] <%= k %> <%= @wmde_secrets[k] %> [14:10:32] <% end -%> [14:10:38] something like that [14:11:42] okay! [14:12:00] (PS2) Addshore: Prepare for puppet stuff [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/287622 [14:12:24] (PS3) Addshore: Prepare for puppet stuff [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/287622 (https://phabricator.wikimedia.org/T125989) [14:12:33] (CR) Addshore: [C: 2 V: 2] Prepare for puppet stuff [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/287622 (https://phabricator.wikimedia.org/T125989) (owner: Addshore) [14:15:26] ottomata: there is another version up for you to take a look at! :) [14:21:08] addshore: cool, 2 small inline comments, i think we can try this... [14:21:45] actually, addshore, will you still be online in 3ish hours? I have a bunch meetings coming up and ihave to prep for one a bit [14:21:56] yup! [14:22:48] k ping me around just in case i forget ( iwill try not too!) [14:22:56] I'll try not too too! [14:32:08] ottomata: added your suggestion thanks! also added some comments in https://phabricator.wikimedia.org/T136314#2446844 [14:32:28] I remember that we had the conversation about -T and -L during a previous standup [14:32:42] I was able to force them on my Vagrant VM [14:34:18] aye [14:34:28] elukey: aye hey! what was your vk q? [14:34:38] elukey: ja i saw those [14:34:52] code review + task update, just wanted to know if it made sense [14:35:14] oh ja makes sense, i don't fully understand the VSL_API thing [14:35:24] is it because that tag has a bunch of key/vals in the one value? [14:36:04] and, maybe i missed it or forgot, but what are we going to use that for? [14:36:05] nono I use the %{}X formatter and it leverages the fmtvar machinery, that needs something like var: value [14:36:26] meanwhile VSL: bla has only error messages, without key value [14:36:34] this is why I had to be "creative" [14:37:01] VSL: timeout 12345 [14:37:02] like that? [14:37:16] the VSL_API formatter should be used, if we want, to signal that a record is half empty due to VSL errors [14:37:22] not strictly needed [14:37:25] aye [14:37:27] nope something like [14:37:27] sounds cool [14:37:32] VSL_API: timeout [14:37:48] VSL_API: memory overflow [14:37:59] so you can specify formatters like [14:38:08] %{VSL_API:timeout} [14:38:15] (missed the end X) [14:38:30] %{VSL_API:memory}X [14:38:37] %{VSL_API:*}X [14:38:49] the first two match a prefix [14:38:58] the last one puts whatever error message [14:39:13] hm, aye, k, i see, because in this case the 'key' is the full 'VSL_API: key' [14:39:21] and the value is whatever comes after [14:39:25] whereas usually you get [14:39:27] 'key: val' [14:39:27] ? [14:39:31] yes [14:39:34] aye got it [14:39:34] cool [14:39:45] I had to invent something to trick vk :) [14:39:48] nice :) [14:39:53] looks good [14:42:49] also I am almost sure that increasing the VSL timeout will not harm anything [14:43:03] the default -L value (maximum incomplete transactions backlog) is 1000 [14:43:06] that is huge [14:43:21] maybe we'll get to tune it for text/upload [14:43:33] but the only side effect will be vk consuming a bit more memory [14:51:12] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra, Patch-For-Review: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2447441 (Eevans) >>! In T124314#2446753, @elukey wrote: > * One major problem that we have been working on (Jos... [14:55:55] joal: --^ I think that the answer is yes but not sure about it [14:56:00] will leave it for you :) [14:56:26] * joal reads [14:58:59] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra, Patch-For-Review: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2447502 (JAllemandou) @Eevans : It is indeed using the same compaction strategy (Leveled). I assume it is becau... [14:59:31] milimetric: Heya [14:59:55] hey joal [15:00:39] milimetric: if you want to review: https://gerrit.wikimedia.org/r/#/c/288210/ [15:02:11] ottomata: going to skip the EB meeting but let me know if you need me [15:04:26] took a quick look, joal, looks like all the stuff we talked about [15:04:39] I can look more later [15:04:59] milimetric: just one thing left behind: field for automatic deletion (I couldn't find a name) [15:07:29] joal you mean the field that says whether a page_delete is happening because of a move_redir? [15:17:53] Analytics: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2447610 (Milimetric) While the solutions proposed here are simple enough, the API's scaling problems were due mostly to it running on HDDs instead of SSDs. The transition to SSDs is almost fini... [15:20:43] Analytics, MediaWiki-API, User-bd808: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321#2447634 (Milimetric) oh, sweet, then it should be pretty straightforward. Grab me in IRC or a hangout whenever we're both free. As far as examples, all... [15:28:14] mforns: hi [15:33:34] joal: i recognize that chart (money) in your background - love it, so interesting [15:33:44] urandom: :) [15:34:20] urandom: I like the reminder of orders of magnitude ;) [15:34:25] yes [15:34:39] urandom: any thoughts on the bulk loading thing? [15:34:45] urandom: if it's the moment of course [15:35:32] joal: i guess i'm unclear about the parameter that tunes the size [15:35:56] isn't it your code that determines when to close a writer? doesn't that correlate with one file? [15:36:35] joal: you're using CQLSSTableWriter? [15:36:44] urandom: hmmm, that doesn't match the idea I have of it [15:37:05] urandom: I'm using CQLBulkOutputFormat [15:37:42] oh i see [15:38:06] urandom: I have copied and corrected some bits of it for the version we use, but it globally works [15:39:05] urandom: In hadoop, when using files, you get one file per reducer (meaning 1 instance of the outputformat is used per reducer) [15:39:45] I see [15:39:57] urandom: In my job, I use 12 reducers, and my understanding is that the outputformat (to be precise, the record writer instanciated by the output format) is taking care of wrting the SSTables in a 'good' way [15:40:40] urandom: But, on each reducer's folder, there are like hundreds of SSTables at the end of the job, e3ach of them fairly small [15:42:06] joal: kk [15:42:21] joal: so, the bulk load finishes 5 times faster? [15:42:38] but it takes longer for the compaction to catch up? [15:42:42] urandom: the load time per-se, even closer to 10 times faster [15:42:59] urandom: https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-compaction [15:43:13] urandom: I let you look at the last 7 daysb [15:43:39] doesn't seem to be making much progress [15:43:44] urandom: indeed [15:43:51] urandom: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Analytics+Query+Service+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [15:43:56] urandom: But it is working ! [15:44:36] so is that all from one load op? [15:44:48] urandom: yessir [15:44:54] urandom: 1 month at a time [15:45:37] so it's been 6 days working to compact the import of a month of data? [15:45:44] urandom: correct [15:46:18] urandom: Last time we imported a month usng classical CQL, it took ~20h to import, and about 2 days to finish compaction [15:46:51] on leveled compaction? [15:47:01] urandom: yes [15:47:04] wow [15:47:14] makes a big difference [15:47:32] urandom: But, with more data being added, compaction time get's longer [15:47:47] yeah, there is that [15:48:00] urandom: But it would be the same with bulk I assume [15:48:08] yeah [15:49:13] urandom: I wonder what to do, wait for compaction, or just wipe and restart using CQL [15:50:38] joal: local_group_default_T_pageviews_per_article_flat: [20285/4, 13/10, 27, 0, 0, 0, 0, 0, 0] [15:51:43] this is ringing a bell [15:51:49] urandom: what I read from that, is that bulk loader doesn't make us take advantage of leveled compaction [15:52:18] urandom: waiting for you to take the call :) [15:54:54] joal: sorry, one moment, looking at some stuff [15:57:15] np urandom, I'm heading to standup [16:00:43] a-team standup! [16:01:01] ottomata: been trying to join for a while [16:01:12] ottomata: let me try incognito mode [16:01:24] ottomata, elukey: could someone with root `apt-get install jvm-tools` on the aqs nodes? [16:01:39] joining [16:01:39] halfak: hello, could you please have a quick look on https://gerrit.wikimedia.org/r/#/c/295494/ ? We're going to include rev_by_bot flag to the event schemas to filter out bot edits in chage-prop precache updates for ORES. Wanted to ask if you're ok with the naming? [16:02:16] nuria_: want us to call you? [16:02:59] milimetric: just restarted [16:03:36] urandom: super urgent or can it wait ~1hr? [16:03:47] elukey: it can wait [16:03:55] super (ops meeting :) [16:04:00] will get back in abit [16:05:55] hey ori. if you have 10 min between now and 10am, let's jump in a Hangout, please. :) [16:13:42] leila: ori sleeping now i think [16:13:50] thanks, nuria_. ;) [16:33:31] joal: quick question - are the berlin picture inside the tar.gz ok? If so I'll send them to the team :) [16:33:53] ottomata: let me know when you are out of ops meeting and we can restart name node [16:35:33] elukey: to me the pictures areb great :) [16:36:01] ah nuria_ ja ok! will do. [16:36:19] Analytics, Zero: Update to new page view definition, including breakout of App page views - https://phabricator.wikimedia.org/T99967#1302907 (Nuria) Marking this as resolved as app pageviews are reported separetely both internally and externally. [16:36:24] Analytics, Zero: Update to new page view definition, including breakout of App page views - https://phabricator.wikimedia.org/T99967#2447964 (Nuria) Open>Resolved [16:40:02] leila: nuria_: hey [16:40:06] Analytics: Productionize Druid Pageview Pipeline - https://phabricator.wikimedia.org/T138261#2447990 (Milimetric) [16:40:08] Analytics-Kanban: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2447989 (Milimetric) [16:40:17] hi ori. :) [16:40:24] Analytics-Kanban: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2395115 (Milimetric) [16:40:26] Analytics: Productionize Druid Pageview Pipeline - https://phabricator.wikimedia.org/T138261#2395071 (Milimetric) [16:40:28] Analytics: Set up dedicated Druid Zookeeper - https://phabricator.wikimedia.org/T138263#2447998 (Milimetric) [16:40:33] i can't jump on a hangout before 10 [16:40:43] Analytics: Puppetize pivot UI - https://phabricator.wikimedia.org/T138262#2448016 (Milimetric) [16:40:45] Analytics: Productionize Druid Pageview Pipeline - https://phabricator.wikimedia.org/T138261#2395071 (Milimetric) [16:40:46] i have to take noam to summer camp [16:40:47] Analytics-Kanban: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2395115 (Milimetric) [16:40:48] bbiab [16:43:09] Analytics, Services: Get http level ops stats for AQS from varnish - https://phabricator.wikimedia.org/T133171#2448044 (Nuria) [16:43:11] Analytics, RESTBase: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#2448046 (Nuria) [16:44:26] Analytics, Analytics-Dashiki: Clean up property passing in dashiki - https://phabricator.wikimedia.org/T132691#2206929 (Nuria) [16:45:37] Analytics: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#2448089 (Nuria) [16:45:39] Analytics: Move datasets.wikimedia.org to analytics.wikimedia.org/datasets - https://phabricator.wikimedia.org/T132594#2448088 (Nuria) [16:47:07] joal: all right, just wanted to know if the tar.gz was ok, will send to everybody :) [16:47:45] Analytics, Analytics-Cluster: Deploy hive-site.xml to HDFS separately from refinery - https://phabricator.wikimedia.org/T133208#2225074 (Nuria) Check whether this is actually been done and update oozie to use the new file path [16:57:02] Analytics: Landing page for unqiue devices datasets on dumps (like other datasets we have) - https://phabricator.wikimedia.org/T130542#2448182 (Nuria) Open>Resolved [17:08:00] ottomata1: I'm still here! :) It's 3 hours in about 15 mins, so I'll give you another poke then) ! : [17:08:53] bye team! going afk! [17:15:11] hmm having some internet problems! [17:15:13] nuria_: am ready [17:15:15] wanna? [17:15:23] ottomata: on meeting with erik z [17:15:33] oh ok [17:16:51] you tell me when you ready nuria_ :) [17:17:00] ottomata1: k [17:17:51] addshore: shall we do your thing then? [17:17:58] oooh, yeh! :D [17:19:14] ottomata1: I guess I also need to add include statistics::wmde to stat1002? [17:19:42] addshore: hm, ja i'll take care of that, will amend or make another patch [17:19:46] first lemme get your sectrets in [17:19:47] okay! [17:19:48] on stat1002 [17:19:51] ahh yes! [17:19:53] can you make a file in your homedir [17:19:54] that has them? [17:20:01] make it only readable by you or something [17:20:03] yeh, they should already be there, let me just find the path [17:20:13] i'll read them out of there and then put them in ops private [17:20:13] k [17:22:51] ottomata1: pmed you the path! [17:27:07] urandom: sorry I forgot the packages! [17:27:47] hmm, addshore k cool, looking over patch one more time [17:28:04] can you change a couple of user params? [17:28:13] yup! [17:28:32] managehome => false [17:28:32] home => $dir [17:28:32] ? [17:29:18] urandom: I installed the pkg on aqs100[456] [17:29:29] elukey: cool [17:29:33] ottomata1: done [17:29:35] oh hm, also, addshore i think the git::clone origins are wrong [17:29:42] oooh *looks* [17:29:44] they look like commands, not origins [17:29:47] git clone ... [17:30:02] oh dam, yes! [17:30:11] addshore: maybe change the name of the logrotate [17:30:16] is there a reason its called 'research-wmde'? [17:30:29] maybe we can call it statistics-wmde or something? since this is in statistics::wmde class? [17:30:47] yup, changed to statsistics-wmde [17:31:13] k [17:31:19] urandom: do you need them also on 100[123] ? [17:32:22] the package seems really safe to install but the pessimist ops in me needs some assurance :) [17:32:31] msg jdlrobson trying to join [17:32:46] / [17:32:54] oh addshore i told you slightly wrong in the config.erb template [17:32:55] need [17:33:03] @wmde_secrets.keys.sort.each.do ... [17:33:15] with .keys. in there [17:33:21] since we want to iterate on the sorted list of keys [17:36:45] updated ottomata1! [17:38:21] (urandom going afk but if you need me I'll be online later on) [17:46:59] ok addshore merging...will run puppet, fingers crossed :) [17:47:10] :D *crosses fingers* [17:48:05] Analytics: Landing page for unique devices datasets on dumps (like other datasets we have) - https://phabricator.wikimedia.org/T130542#2448624 (Aklapper) [17:48:58] haha [17:48:59] addshore: [17:49:03] require => User["${dir}/src"], [17:49:16] also,i just noticed you got double slashes beceause $dir defaults to /srv/.../ [17:49:17] so [17:49:22] remove the trailing slash on the param [17:49:38] and fix the require => User[$dir..] [17:49:44] doing! [17:49:45] Could not find dependency User[/srv/analytics-wmde//src] [17:50:02] i thik you meant to say File [17:52:43] oh my internet has taken this time to slow down so so so much, still trying to pull the latest patch! [17:55:56] ottomata1: the fix patch is up! https://gerrit.wikimedia.org/r/298311 [17:56:49] k [18:02:48] addshore: it ran! [18:02:50] let's see [18:02:57] okay! [18:03:10] HMM< addshore oh. [18:03:11] hm [18:03:13] hmmmm [18:03:23] addshore: i'm going to make another patch, will explain in a sec... [18:03:32] okay! [18:06:46] adhttps://gerrit.wikimedia.org/r/#/c/298314/ [18:06:48] addshore: https://gerrit.wikimedia.org/r/#/c/298314/ [18:07:42] oooh, okay! [18:12:02] ok addshore done. [18:12:08] check /a/analytics-wmde on stat1002 [18:12:12] see if it looks good to you [18:12:44] hehe [18:12:45] /bin/sh: 1: /a/analytics-wmde/minutely.sh: Permission denied [18:12:47] in minutely.log [18:12:57] haha! [18:13:00] ja [18:13:00] you want [18:13:07] mode => '0754', i think [18:13:09] on those files [18:13:51] okay! [18:14:29] I'll put a patch up for that once my internet catches up with reality again... [18:14:56] Thanks so much ottomata1 ! Also, regarding https://gerrit.wikimedia.org/r/#/c/296407/ when and how can we start that running? [18:16:38] oh, um, i guess anytime, i'd work with joal on that though, since you two have all the context for it [18:17:16] awesome! :) I wasn't sure who to poke next for that one! [18:21:11] addshore: i dont' mind, shall I just patch the mode real quick? [18:21:22] that would be amazing! [18:21:40] I'm going to go and run a new netowrk cable through the middle of my house for the rest of the day, something is not right with the wifi... [18:24:56] addshore: hmmmmmm [18:24:57] also [18:25:03] shoulda thought of this before [18:25:03] but [18:25:09] since we moved this into the statistics module [18:25:15] we could just reuse the statistics user [18:25:21] instead of having a special analytics-wmde user [18:25:22] hm [18:30:27] joal: you still around? [18:33:27] milimetric: ping? [18:51:52] Analytics, Analytics-EventLogging, Performance-Team: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#2449204 (ori) a:Krinkle>ori [19:01:14] ottomata: back if you have time [19:02:45] j let's do it! [19:03:19] ok nuria_ step one, merge the change and run puppet [19:03:23] that won't restart anything, so is safe to do now [19:03:29] doing that [19:03:47] ottomata: where are we running puppet? [19:03:51] in the meantime, read this if you haven't already [19:03:52] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#High_Availability [19:03:58] analytics1001 and analytics1002 [19:04:18] only 6 tasks to https://phabricator.wikimedia.org/T140000!!! we should totally try and get it for something cool :) [19:04:32] ottomata: rename event bus? ^ :) [19:04:48] haha [19:04:50] eventbus is here to stay [19:04:54] :) k [19:05:17] but still, https://phabricator.wikimedia.org/T139995 is not created yet, so 6 and counting [19:05:51] ottomata: k [19:06:22] ottomata: that'll teach you to be stubborn! [19:06:33] haha [19:06:42] ottomata: i bet you wish it were called Surge, now, eh? [19:06:50] how much do you want to bet? [19:06:52] heheh [19:07:43] i'm honestly surprised it didn't end up being called Eventoid [19:08:14] hm nuria_ something weird happened with your comit [19:08:23] the sha you updated the submodule too didn't exist in gerrit [19:08:27] ottomata: when running puppet? [19:08:29] https://gerrit.wikimedia.org/r/#/c/298330/ [19:08:30] no [19:08:33] when attempting to merge [19:08:50] ottomata: what? ok taht si a 1st one [19:08:54] *that [19:08:58] fatal: reference is not a tree: 265e301ac3f91162e5266effeaa4b5fb64cecc9c [19:08:58] Unable to checkout '265e301ac3f91162e5266effeaa4b5fb64cecc9c' in submodule path 'modules/cdh' [19:09:21] nuria_: probably you commit it to hte non rebased sha or something, who knows [19:09:36] you almost always want to do [19:09:39] cd modules/cdh [19:09:41] git checkout mster [19:09:42] master* [19:09:44] git pull [19:09:45] cd ../.. [19:09:51] git add modules/cdh && git commit [19:09:58] so you can be sure you are at the head of master [19:10:20] ottomata: right, that is what *i though* i did but obviously ahem .. i did not [19:10:25] ja dunno [19:10:26] s'ok though [19:11:01] ok running puppet on an01 and an2 [19:11:01] 02 [19:11:25] nuria_: log into one of those and you can run the failover commands when it is time [19:11:35] k on 1002 [19:11:49] uh oh... [19:11:55] ottomata: yesss [19:12:05] bubu? [19:12:53] hah nuria_ [19:12:58] you didn't actually make the parameter :p [19:13:03] you just added documentation for it [19:13:06] what? [19:13:09] in hadoop.pp [19:13:11] need [19:13:20] $yarn_log_aggregation_retain_seconds = $::cdh::...defaults::yarn_log_aggregation_retain_seconds, [19:13:24] in the parameters [19:13:30] sorry,i shoulda noticed that [19:13:40] Analytics: Design new UI for Wikistats 2.0 - https://phabricator.wikimedia.org/T140000#2449417 (Milimetric) [19:13:49] ottomata: nah, very sorry on my end [19:13:51] well, nobody took it so I did :) [19:15:56] ottomata: i can fix it now unless you just did it [19:17:12] nuria_: haven't please go ahead [19:22:55] ottomata: ah, i see what is wrong with my module now. fixing. [19:28:09] ottomata: done, i think [19:30:33] nuria_: merged, want to try the submodule commit again? [19:30:48] ottomata: yes [19:32:19] Analytics-Kanban, Patch-For-Review: Event Logging doesn't handle kafka nodes restart cleanly - https://phabricator.wikimedia.org/T133779#2449497 (Ottomata) Kafka restart tests in beta look good. librdkafka logging gets a little verbose when a broker is down. It prints a connection refused message about... [19:38:18] ottomata: let me know if it looks correct now [19:38:50] that looks good nuria_, merging [19:41:34] urandom: hey, what's up? [19:42:36] milimetric: you are late enough that i decided to ask for forgiveness, rather than permission :) [19:42:44] milimetric: so... forigive me! [19:42:52] done, forgiven [19:43:02] wow, that was easy! [19:43:21] :) now, what am I forgiving for [19:43:36] that was not a part of the deal, i'm afraid [19:43:41] :) [19:43:49] heh [19:44:04] milimetric: you are forgiving me for a unthrottling compaction on aqs1004, and double compactor concurrency [19:44:12] s/double/doubling/ [19:44:28] oh ok, cool, yall think that'll help? [19:44:36] because it was crawling, and i wanted to see what would happen [19:44:44] answer: it crawls slightly faster [19:45:01] hm, not awesome, but useful to know I guess [19:45:25] jo will be interested tomorrow, I'll relay if you two don't chat first [19:45:33] negative results are results too [19:45:38] kk [19:46:06] i'm going to undo this before the day is up, but it had almost no effect, and the settings are ephemeraal [19:46:15] with a non-Finnish spelling [19:46:21] thanks for the experiment [19:46:42] heh [19:46:48] thanks for forgiving me! [19:47:30] :) experiments on non prod instances always welcome [19:47:44] it's a pretty tame experiment :) [19:48:13] if it had gone totally sideways, it would have driven utilization, and i would have simply reset it [19:48:26] driven it up, that is [19:50:09] ottomata: am i running puppet ? [19:50:16] haha [19:50:22] nuria_: i am but it isn't setting the value properly [19:50:26] dont' see what's wrong immediately, am looking [19:50:30] ottomata: k [19:50:31] yarn.log-aggregation.retain-seconds [19:50:31] [19:50:47] ottomata: k, there must be a typo somewhere [19:50:52] ottomata: looking too [19:51:14] milimetric: https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=aqs1004.eqiad.wmnet&m=cpu_report&r=4hr&s=by%20name&hc=4&mc=2&st=1468262986&g=cpu_report&z=large&c=Analytics%20Query%20Service%20eqiad [19:51:38] AH [19:51:39] nuria_: [19:51:40] need [19:51:44] <%= in the template [19:51:46] not <% [19:51:54] also, put a trailling space after the var please [19:52:04] <%= @yarn_log_aggregation_retain_seconds %> [19:54:04] ottomata: argh SO SORRY! [20:01:53] nuria_: merged [20:02:03] proceed with submodule update [20:08:17] nuria_: ? [20:08:31] ottomata: just did [20:08:38] ah great [20:08:48] ottomata: https://gerrit.wikimedia.org/r/#/c/298341/ [20:11:07] nuria_: oook finally, its there now [20:11:08] ok [20:11:12] ottomata: pufffff [20:11:20] do we really have to restart namenode? [20:11:22] or just resourcemanager? [20:11:25] this is a yarn setting, no? [20:11:51] ottomata: from docs it said deletionservice on namenode but let me dig that out again [20:11:57] ok [20:12:31] DeletionService: A service that runs inside the NodeManager and deletes local paths as and when instructed to do so. [20:12:34] hm [20:13:18] ottomata: basically a daemon on a cron right? [20:13:34] uhhh [20:13:36] i dunno what it is [20:13:43] certinaly not a cron [20:14:04] ottomata: well kind-of-a-cron [20:14:20] tasktimer i bet [20:15:30] nuria_: according to some docs i see [20:15:32] i think this is a nodemanager [20:15:33] setting [20:15:35] not even resourcemanager [20:15:55] hm either that or jobhistory server [20:17:10] nuria_: if you find docs lemme know [20:17:57] ottomata: well, only sourceforge thus far [20:18:07] ottomata: i read the deleteion service start here: [20:19:59] *deletion service [20:21:42] ottomata: Ok, the info about "deletion service restart" is for this other log setting: yarn.nodemanager.delete.debug-delay-sec [20:21:54] ottomata: which is the one I tough we would use originally [20:22:16] Number of seconds after an application finishes before the nodemanager's DeletionService will delete the application's localized file directory and log directory. To diagnose Yarn application problems, set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you [20:22:16] must restart the nodemanager in order for it to have an effect. The roots of Yarn applications' work directories is configurable with the yarn.nodemanager.local-dirs property (see below), and the roots of the Yarn applications' log directories is configurable with the yarn.nodemanager.log-dirs property (see also below). [20:22:23] nuria_: according to this [20:22:24] https://amalgjose.com/2015/08/01/enabling-log-aggregation-in-yarn/ [20:22:26] just nodemanagers [20:22:40] right, that makes sense [20:23:13] nuria_: note also that that says 'nodemanager' not 'namenode' [20:23:53] restarting nodemanagers is easy...just restart them [20:24:00] but there's one on each worker node [20:24:16] nuria_: going to restart nodemanager on one worker node and watch logs, i don't expect to see anything [20:24:31] ottomata: k [20:28:03] ok nuria_ looks pretty normal, i'm going to restart each nodemanager one by one [20:28:05] slowly [20:28:09] then we can wait and see if it works [20:30:43] ottomata: where did joal look log filesizes? [20:30:48] ottomata: on hue? [20:31:02] nuria_: naw [20:31:03] ottomata: must be no? to report overall in cluster [20:31:14] sudo -u hdfs hdfs dfs -du -s -h /var/log/hadoop-yarn [20:31:16] probably would do it [20:31:40] !log rolling restart of hadoop-yarn-nodemanager to apply log aggregation retention seconds [20:37:38] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:38:56] ottomata: that alarm is from the restart, correct? [20:39:13] yea [20:39:18] dunno why that one took so long to restart [20:39:48] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:16:03] bye a-team, see you tomorrow! [21:16:24] bye!