[08:14:38] !log restarting kafka on kafka{1012,1014,1022,1020,2001,2002} for Java upgrades. EL will be restarted as well (sigh) [08:14:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:17:32] kafka1012 + EL restart done [08:21:04] Thanks elukey :) [08:22:55] joal: o/ [08:26:45] (Additional properties are not allowed (u'varnish4hits', u'varnish4' were unexpected)) [08:26:48] mmmmm [08:32:27] ahaahhaha just noticed [08:32:28] https://phabricator.wikimedia.org/T114443#2244925 (Ottomata) YEEHAW [08:44:00] Yay ! [08:45:26] kafka1014 + EL restarted [08:48:49] (CR) Joal: [C: 2 V: 2] "Merging for deploy." [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/285535 (https://phabricator.wikimedia.org/T132267) (owner: Alex Monk) [08:52:21] (PS1) Joal: Update aqs to b05ebbe [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/285908 [08:53:14] (CR) Joal: [C: 2 V: 2] "Merging for deploy." [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/285908 (owner: Joal) [08:58:52] kafka1022 + EL restarted [08:59:04] 1020 is the last one [09:00:22] elukey: quick help for me? [09:03:26] !log Deploying aqs on aqs1001 [09:03:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:03:47] joal: sure! [09:04:02] what can I do? [09:04:19] I tried to deploy aqs on beta, but it failed with that error: Host key verification failed. [09:04:27] elukey: Does it ring a bell to you ? [09:05:33] might be related to the bastion change? [09:05:43] where did you run the commnad? [09:05:45] *command [09:06:07] from deployment-tin [09:06:35] Actually, looks like aqs1001 is down :( [09:07:18] WHHHHAAAAAT? [09:08:17] Apr 28 09:03:02 aqs1001 firejail[26880]: Error: Cannot find module '/srv/deployment/analytics/aqs/deploy-cache/revs/49b3ad6cfc319a7fb29b33b2a4e2ad61bb5888eb/src/server.js' [09:08:38] joal --^ [09:09:07] elukey: deployed failed, offering to rollback, I said yes [09:09:20] Then tried again, succeeded, but seems not to have worked :( [09:09:36] all right let me depool the host [09:09:57] elukey: I think it's just about to restart restbase [09:11:35] doesn't seem to work.. [09:11:45] elukey: hm [09:11:50] well it is depooled now so we can work without pressure :) [09:12:27] elukey: any idea where to find restbase logs on aqs? [09:12:37] logstash only :( [09:12:41] cassandra logs are in /var/log [09:12:45] elukey: right [09:15:07] elukey: seems only to be cassandra logs in logstash [09:16:02] https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase and then aqs1001 [09:16:13] buuuut I found also journald logs [09:16:13] Thx ! [09:16:52] joal: https://dpaste.de/n4qk [09:17:15] not sure if it can help [09:18:00] elukey: related to deploy having failed previously I guess [09:18:25] elukey: the repo /srv/deployment/analytics/aqs/deploy seems up to date [09:18:31] joal: I missed a point maybe: you did the rollback [09:18:36] or no? [09:18:39] elukey: I did [09:18:46] elukey: Then deployed again [09:18:59] ahhh okok makes sense, but then you rolledback? [09:19:00] elukey: was there something I should have done in the middle? [09:19:15] elukey: scap offered to di it because of deployment failure [09:19:34] so atm the code should be pre-deployment theoretically [09:19:40] and it should work [09:19:46] right? [09:19:55] nope, as I just wrote, retried to deploy, successfully [09:20:04] So code is deployed [09:20:28] and folder is up to date, but because of rollback, restbase have been stopped I think [09:20:44] ahhhhhhhh okok [09:21:05] elukey: I hope that restarting restbase would work [09:21:19] but I don't know how it's done [09:21:51] on aqs1002, I see some firejail thing I don't know about [09:23:44] so on aqs1001 I tried to restart aqs but same error [09:23:50] aqs1002 looks fine [09:23:58] I deployed on aqs1001 only [09:24:22] elukey: that's weird, the folder exists [09:24:24] well the new code has something wrong then :) [09:24:44] Where do see logs ? [09:25:33] I posted a paste earier on, let me grab it [09:25:34] elukey: actually src folder in cache folder is empty ! [09:25:38] https://dpaste.de/n4qk [09:27:14] elukey: I need some root here :) [09:28:07] joal: bat-cave? [09:28:14] elukey: OMW [09:54:15] (CR) Mforns: "LGTM! There is still a with the misplaced slash, I will change that and merge. I changed the config to '" [analytics/dashiki] - https://gerrit.wikimedia.org/r/285255 (https://phabricator.wikimedia.org/T133736) (owner: Nuria) [09:56:25] (PS7) Mforns: Add out of service banner to dashiki [analytics/dashiki] - https://gerrit.wikimedia.org/r/285255 (https://phabricator.wikimedia.org/T133736) (owner: Nuria) [09:57:58] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/dashiki] - https://gerrit.wikimedia.org/r/285255 (https://phabricator.wikimedia.org/T133736) (owner: Nuria) [10:14:42] Analytics, DBA: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#2247651 (mforns) @Nuria @jcrespo We should not support editCount bucketting for new schemas. This task is exclusive for existing schemas at the time of the audit, as a backwards compatibility feat... [10:44:20] !log deployed aqs on all three nodes (Thanks elukey !!!!) [10:44:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:44:27] \o/ [10:48:21] Analytics: Make a script to automatise the 4 commands to run for aqs deployment - https://phabricator.wikimedia.org/T133863#2247732 (JAllemandou) [10:51:52] kafka1020 and EL restarted, completed kafka restarts [10:59:29] elukey: sorry [11:00:42] elukey: went away for longer than planned [11:01:41] joal: no worries! [11:01:49] :) [11:02:00] so we were saying - where you did you get the beta error ? [11:03:15] Logged in deployment-tin.deployment-prep.eqiad.wmflabs [11:03:45] then followed the same routine as with prod (except I didn't mees up the submodule) [11:04:30] in the deploy-tin machine, git folders are up to date [11:04:47] But can't deploy [11:05:02] deployment-aqs01.deployment-prep.eqiad.wmflabs returned [255]: Host key verification failed [11:06:06] I can access to it, really weird.. mmmm [11:07:06] can you paste the command used in here so I can try? [11:07:12] I have a suspicion [11:07:22] elukey: on deploy-tin: dpeloy [11:07:26] deploy [11:07:30] sorry [11:08:56] joal: so soemthing like deploy --limit deployment-aqs01.deployment-prep.eqiad.wmflabs ? [11:09:41] elukey: nope [11:14:23] elukey: folders are not setup correctly in deploy-aqs01 [11:15:33] buuuuu [11:15:44] elukey: no src folder [11:21:15] joal: lunch and then debug? [11:24:35] * elukey lunch! [13:00:59] joal back :) [13:08:23] mobrovac: do you have 5 minutes to review https://gerrit.wikimedia.org/r/#/c/285393/7 by any chance? [13:08:29] just to be sure :) [13:08:54] i'll take a look in 5 mins or so elukey [13:09:20] \o/ [13:09:26] just found a little typo [13:32:33] mobrovac: thankssss I missed the eqiad bit, really sorry [13:32:51] ehe [13:32:55] np elukey [13:32:58] glad i could help [13:33:49] all right fixing the typo then merging, next step is to configure the kafka codfw nodes [15:06:20] mforns: I am going to deploy new dashiki code [15:06:39] madhuvishy: hola! were you able to get the info from amanda yesterday? [15:15:13] nuria_, ok, do you need copilot? [15:15:50] mforns: no, I will deploy to test and after to the main instance, of all of them the only one that gets significant visits (ahem... 20+) [15:15:55] is teh browser reports [15:15:57] *the [15:16:00] hehe [15:23:29] elukey: sorry, went AFK for a while [15:23:32] back now elukey [15:23:49] o/ currently doing something dangerous, brb : [15:23:50] :P [15:23:57] np elukey [15:24:05] elukey: If I can help, please ping [15:42:53] elukey, mforns deployed new dashiki code to all instances [15:42:59] cool [15:43:40] nuria_, I found a bug in the breakdowns... sorry for not spotting it in the review [15:44:07] if you add another metric besides the default metric, the breakdown format breaks [15:44:17] only for the non-default metric [15:45:54] niceeeeee [15:51:19] Analytics-Kanban, Patch-For-Review: Out of service banner in dashiki - https://phabricator.wikimedia.org/T133736#2248901 (Nuria) [15:54:49] (PS1) Mforns: Add unique devices api [analytics/dashiki] - https://gerrit.wikimedia.org/r/285977 (https://phabricator.wikimedia.org/T122533) [15:55:25] Analytics-Kanban, Patch-For-Review: Visualize unique devices data in dashiki {bear} - https://phabricator.wikimedia.org/T122533#2248911 (mforns) [15:55:26] nuria_: https://gerrit.wikimedia.org/r/#/c/285976/1 - code review to return the custom 503 from varnish on monday :) [15:56:16] Analytics, Analytics-Dashiki, Easy, Patch-For-Review: Dashiki breakdown layout problems. UI - https://phabricator.wikimedia.org/T133312#2248914 (Nuria) Uploading screenshot of what else needs fixing [15:56:41] Analytics, Analytics-Dashiki, Easy, Patch-For-Review: Dashiki breakdown layout problems. UI - https://phabricator.wikimedia.org/T133312#2248917 (Nuria) {F3942687} [16:02:34] joal: standuuuuppp [16:03:51] a-team: FYI, mforns and nuria_ are inclass this morning [16:04:01] yes [16:05:13] nuria_: Yeah I spoke to amanda, there was no problem with wikimetrics - just some misconfiguration during launching reports causing discrepancies [16:05:30] madhuvishy: is that solved then? [16:06:06] yeah [16:07:52] Analytics-Kanban: Make the AQS unique-devices endpoint return 'devices' as a numeric value, not a string - https://phabricator.wikimedia.org/T133527#2248933 (JAllemandou) [16:08:45] Analytics-Kanban, Patch-For-Review: Out of service banner in dashiki - https://phabricator.wikimedia.org/T133736#2248936 (madhuvishy) p:Triage>Normal a:Nuria [16:12:04] https://wikitech.wikimedia.org/wiki/Category:Data_stream [16:19:47] https://www.citusdata.com/docs/citus/5.0/tutorials/tut-real-time.html#tut-real-time [16:24:57] Analytics-Kanban, DC-Ops, EventBus, MediaWiki-Cache, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2248960 (elukey) [16:25:27] Analytics-Kanban, Operations, ops-codfw, Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2248961 (elukey) [16:25:47] Analytics-Kanban, Operations, ops-codfw, Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (elukey) [16:25:53] Analytics-Kanban, DC-Ops, EventBus, MediaWiki-Cache, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (elukey) [16:45:38] Analytics-Kanban: Make the AQS unique-devices endpoint return 'devices' as a numeric value, not a string - https://phabricator.wikimedia.org/T133527#2248999 (Nuria) Open>Resolved [16:45:50] elukey: aqs-deploy deployment debug? [16:45:53] ot later? [16:45:56] Analytics-Kanban, Patch-For-Review: Allow filtering of data breakdowns in pageview metric - https://phabricator.wikimedia.org/T131547#2249000 (Nuria) Open>Resolved [16:46:07] joal: sure! [16:46:17] Analytics-Kanban, RESTBase-Cassandra: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2249002 (Nuria) [16:46:19] Analytics-Kanban: analyse AQS queries over the previous month or weeks to have a better understanding of how compaction should behave - https://phabricator.wikimedia.org/T133016#2249001 (Nuria) Open>Resolved [16:46:29] Analytics-Kanban, Datasets-General-or-Unknown: Improve loading Analytics Query Service with data {slug} [5 pts] - https://phabricator.wikimedia.org/T115351#2249004 (Nuria) [16:46:33] joal: [16:46:36] filesystem? [16:46:37] joal: mmm thinking about it again, do you mind if we do it tomorrow? [16:46:42] for cassandra partition? [16:46:50] ottomata1: XFS! :P [16:46:52] elukey: no problemo :) [16:46:56] ext4 is now [16:47:00] hehe [16:47:05] ext4 is on aqs* now [16:47:13] yeah I think it is fine [16:47:28] ottomata1: Don't know - I assume if we go for ext4 we'd like to remove journal stuff, but not sure [16:47:32] I'll ask urandom [16:48:01] joal: mmm I would keep journaling, even with raid [16:48:15] anyhoowww [16:48:18] logging off people [16:48:22] byyeeeee o/ [16:48:27] bye elukey :) [16:48:41] byeee [16:49:26] ottomata1: I assume we'll do as it currently is for aqs [16:53:06] ok [16:59:50] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249043 (Ottomata) Ok! We discussed partitioning today. We'd like the following: - / a small (30G?) RAID 1 partition on the first 2 drives. - 2 RAID 10 (probably ext4, asking to be su... [17:02:00] joal: did you guys do tasking? [17:02:22] nuria_: we did a quick pointing of kanban, and discussed what is coming next [17:02:28] joal: k [17:02:42] nuria_: but no proper follow-the-rule tasking [17:03:06] joal: will schedule 1 hour extra tasking [17:03:25] nuria_: ok :) [17:06:55] (PS1) Joal: [WIP] Include webrequest refine oozie job into load one [analytics/refinery] - https://gerrit.wikimedia.org/r/285998 (https://phabricator.wikimedia.org/T130731) [17:10:37] (CR) Ottomata: "Hm! Ok Joal I I didn't notice that. Sounds like a good convention!" [analytics/refinery] - https://gerrit.wikimedia.org/r/285400 (https://phabricator.wikimedia.org/T130732) (owner: Joal) [17:10:56] (CR) Ottomata: [C: 1] "Maybe we could document the convention in a README somewhere?" [analytics/refinery] - https://gerrit.wikimedia.org/r/285400 (https://phabricator.wikimedia.org/T130732) (owner: Joal) [17:16:33] Analytics, Hovercards, Reading-Web-Backlog, Reading-Web-Sprint-71-Matisse-Monet-Kandinsky-and-the-Departing-Painters: Verify X-Analytics: preview=1 in stable - https://phabricator.wikimedia.org/T133067#2249095 (dr0ptp4kt) Pending on https://gerrit.wikimedia.org/r/#/c/285051/ and friends. [17:28:06] (PS4) Joal: Normalize oozie job names (bundles, coords, wfs) [analytics/refinery] - https://gerrit.wikimedia.org/r/285400 (https://phabricator.wikimedia.org/T130732) [17:28:46] (CR) Joal: "Good catch ottomata, updated README.md" [analytics/refinery] - https://gerrit.wikimedia.org/r/285400 (https://phabricator.wikimedia.org/T130732) (owner: Joal) [17:47:42] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249164 (Ottomata) If it is easier to put the `/` partition RAID1 across the first 4 drives, that is fine too. [17:49:16] (CR) Ottomata: [C: 1] Normalize oozie job names (bundles, coords, wfs) [analytics/refinery] - https://gerrit.wikimedia.org/r/285400 (https://phabricator.wikimedia.org/T130732) (owner: Joal) [17:52:47] ottomata: I get it why you are in SF now, it's for the "bring your kids to work" day ! [17:53:06] * joal should have gone too [17:53:40] haha [17:53:42] yeah! [17:56:38] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249206 (RobH) So I've chatted with @ottomatta about this in IRC. Setting up this suggestion: |/|sda1, sddb1|radi1 |/var/lib/cassandra/a|sda2.sdb2, sdc1, sdd1|raid10 |/var/lib/cassandr... [18:42:35] yay mforns :D [18:42:36] * lzia claps for Marcel. :D [18:42:41] :D [18:45:14] omg, download as csv? :D [18:52:38] indeed! I am glad credit is shared! [19:00:13] a-team: retro? [19:00:25] nuria_, yep [19:00:28] nuria_: nope - lunch [19:00:30] nuria_: will skip, late here [19:00:45] mforns for the win ! [19:00:51] a-team: will reschedule for tomorrow [19:00:52] :] [19:01:54] a-team:rescheduled retro for tomorrow, right after standup [19:02:15] ok [19:02:53] elukey was missing invitation to retro, have corrected that [19:04:10] hey guys! Is Aaron on your team? what might his IRC handle be? [19:04:26] madhuvishy: I added chambers for retro tomorrow [19:04:39] cc ggellerman [19:04:47] I cannot remove the other room [19:04:57] MusikAnimal: it's halfak [19:05:08] o/ [19:05:33] MusikAnimal, I'm generally around -analytics, but I live in -research. :) [19:05:57] oh ok :) [19:09:50] (PS2) Joal: Include webrequest refine oozie job into load one [analytics/refinery] - https://gerrit.wikimedia.org/r/285998 (https://phabricator.wikimedia.org/T130731) [19:10:48] A-team, logging off for tonight ! [19:10:55] See you tomorrow :) [19:11:09] Analytics-Kanban, Patch-For-Review: Make webrequest load and refine jobs a single bundle - https://phabricator.wikimedia.org/T130731#2249494 (JAllemandou) [19:20:05] joal, bye! [19:49:14] (CR) Ottomata: [C: 1] "One nit!" (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/285998 (https://phabricator.wikimedia.org/T130731) (owner: Joal) [20:09:21] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249676 (RobH) Ok, old comment was wrong, had bad disk info. New suggestion: |mount|disks|raid level|size |/|sda1,sdb1, sdc1, sdd1 |raid10|50GB |/var/lib/cassandra/a|sda2.sdb2, sdc2, sd... [20:10:44] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249677 (Ottomata) +1, makes sense. Thank you! [20:18:00] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249692 (Cmjohnson) [20:20:15] Analytics-Kanban, Operations, ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249697 (Cmjohnson) Racked one each in A4, C5, D4 [20:36:46] (CR) Nuria: Add unique devices api (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/285977 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [20:39:20] (CR) Mforns: Add unique devices api (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/285977 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [20:39:35] nuria_, thanks, will look at a way to duplicate less code [20:39:55] mforns: i think we are going to have to move code to sitematrix [20:41:00] sitematrix? [20:41:09] mforns: let me think about it and may be i can offer a better suggestion, i can work on top of your changeset [20:41:47] nuria_, ok by me! I was starting to work on the problem with the breakdown [20:41:51] mforns: when it comes to visualization we need to think what are we going to do for projects for which there is no data [20:41:59] aha [20:42:27] cause for many small projects (unlike the pageviewapi) data for uniques is not computed [20:43:28] mforns: will think about this tomorrow [20:45:00] nuria_, I know, ok. let's look at it tomorrow [20:45:07] mforns: k [20:45:10] cya [23:14:22] (PS1) Alex Monk: [WIP] Revert "Revert "Database selection"" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/286094 (https://phabricator.wikimedia.org/T76466) [23:25:03] Analytics-EventLogging: Raw User-Agents get stored - https://phabricator.wikimedia.org/T64978#2250261 (Krinkle) [23:25:42] (CR) Alex Monk: [C: -1] "need to deal with this later" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/286094 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk)