[00:56:11] 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2971163 (10mobrovac) \o/ Thank you guys! Is the task complete then? [01:09:00] ottomata: thanks from me too for the explanation regarding the "file not found" error, I added something about it here https://wikitech.wikimedia.org/w/index.php?title=Analytics/Data/Webrequest&diff=1395762&oldid=1093578 [05:38:13] 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2971512 (10Nuria) 05Open>03Resolved [05:38:21] 10Analytics, 10ChangeProp, 10Citoid, 10ContentTranslation, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2971513 (10Nuria) [08:05:58] 10Analytics, 15User-Elukey: Add webrequest_stats to Druid in order to explore it with Pivot - https://phabricator.wikimedia.org/T150844#2971623 (10elukey) 05Open>03declined Not really needed [08:39:06] 10Analytics, 10Analytics-General-or-Unknown: analytics.wikimedia.org loads resources from third parties - https://phabricator.wikimedia.org/T156347#2971655 (10Nemo_bis) [09:41:22] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 2 others: Rewrite ORES precaching change propagation configuration as a code module - https://phabricator.wikimedia.org/T148714#2971851 (10Liuxinyu970226) [09:44:39] joal: o/ if you are ok, I'd go for Eric's suggestions about rack awareness - https://gerrit.wikimedia.org/r/#/c/334035/ [09:44:48] the CR is to bootstrap aqs1007-a [09:44:51] elukey: o/, reading [09:44:59] New IPs are already in place [09:46:55] elukey: just read - some questions :) [09:47:25] elukey: In commit message you say: "Boostrapping the instance could be done in a separate" [09:47:28] commit but I would like to avoid the risk of having [09:47:30] the default Cassandra instance joining the cluster [09:47:43] I don't understand :( [09:49:57] it is the thing that we were discussing yesterday about puppet cassandra classes.. [09:50:06] hm [09:50:32] currently if puppet does not find and instance configured in hiera, it thinks that it needs to create a default cassandra instance [09:50:48] that will be bootstrapped [09:50:56] as soon as the first puppet run will go [09:51:23] (we discussed about killing that bootstrap promtly to avoid issues etc..) [09:51:48] but since we need to bootstrap one instance anyway, I thought that aqs1007-a was a good candidate [09:51:55] I don't understand which part of the commit could be done in a separate commit - Which portion is the bootstrapping one? [09:52:31] well the bootstrap will be done by cassandra while starting the first time during the puppet run [09:53:01] if an instance is specified in hiera (https://gerrit.wikimedia.org/r/#/c/334035/7/hieradata/hosts/aqs1007.yaml) it will create that one [09:53:09] otherwise it will use a "default" configuration [09:53:44] usually for restbase https://gerrit.wikimedia.org/r/#/c/334035/7/hieradata/hosts/aqs1007.yaml can be done separately [09:54:12] Ok, so the file that could have been commited separately is hieradata/hosts/aqs1007.yaml (because it means actually launching the instance and not just defining the setup) [09:54:13] because even if the default instance starts, it will not find any ssl certificate in the private repo and it will fail very soon [09:54:32] But, we want to commit it at the same time, because of bootstrapping and default conf issue [09:54:35] Right? [09:54:41] correct :) [09:54:44] this is the idea [09:55:05] Oooook, it definitely makes sense, but I think the commit message could be a bit more detailed on that aspect ;) [09:55:38] Or maybe it is because I'm puppet noob? [09:56:42] nono I can definitely improve it, the msg assumes too many things :) [09:57:06] I wanted to know if the idea was ok, together with the rack awareness [09:58:05] oh yes, of course: Well as discussed earlier, as long as there is no conflict between virtual-reacks and physical ones, sounds good to me to keep rack-number = rep-factor, therefore using rack-1 for aqs1007 etc [10:02:53] super [10:03:08] amended the commit msg, hopefully it gives a bit more background [10:04:04] elukey: Yay, way better for me :) [10:04:08] Thanks ! [10:05:27] thanks for reviewing! [12:02:04] * elukey lunch [12:14:25] heloooo [12:30:01] mforns: o/ [12:30:11] elukey, joal: FYI I am going to restart one cassandra instance on aqs1004 to pick up openjdk updates [12:30:41] ok elukey [12:30:42] checked nodetool netstats and compactionstats, all quiet [12:31:51] done.. [12:32:01] will leave it running for one hour to see if anything blows up [12:32:31] and then I'll restart all the instances [12:32:44] (At once to spice up a bit these boring day :P) [12:33:41] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.126 and port 9042: Connection refused [12:34:41] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 [12:43:05] this one was due to cassandra taking a bit too much to start :) [12:57:31] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.127 and port 9042: Connection refused [12:58:31] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.127 port 9042 [13:51:59] (03PS6) 10Mforns: [WIP] Add banner activity jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/331794 (https://phabricator.wikimedia.org/T155141) [13:53:58] proceeding with aqs1005 instances! [13:55:09] (03CR) 10Mforns: "Changes added in the last patch set:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/331794 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [14:00:31] (03PS3) 10Joal: Update sqoop script with labsdb specificity [analytics/refinery] - 10https://gerrit.wikimedia.org/r/334042 (https://phabricator.wikimedia.org/T155658) [14:03:29] taking a break a-team, see you at standup [14:03:41] ok joal :] [14:54:45] AQS cluster restarted, everything looks good [14:55:02] during the next days we'll have to do the hadoop cluster too sigh [14:56:10] elukey: for security stuff? [14:56:28] if we do, along the way: https://phabricator.wikimedia.org/T147879 [14:56:29] i can help [15:04:14] ottomata: yep, java security updates [15:04:34] I don't expect an update for java 7 before next week, thoigh [15:08:11] * fdans goes to make some super late lunch before the standup [15:16:37] ottomata: ah snap the fstabs! [15:16:51] oh yeahhhh [15:17:00] but maybe we can wait for a kernel upgrade or something similar ? [15:17:14] why oh, because we don't need to reboot for the java one? [15:17:14] hm [15:17:34] yes only restart the daemons [15:17:40] yeah, no worried [15:17:41] worries [15:24:05] a more recent kernel is available in the mean time (wmf8 based on 4.4.39) compared to what's running on kafka*, but that doesn't have critical bugfixes requiring a reboot, only non-criticial ones an generally a ton of bugfixes from the 4.4.x stable series [15:35:36] urandom: o/ if you have time https://gerrit.wikimedia.org/r/#/c/334035/ [15:35:43] aqs1007-a ready to go [15:52:07] milimetric: yt? [15:54:21] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2972903 (10Nuria) 05Open>03Resolved [15:54:35] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Standardize logic, names, and null handling across UDFs in refinery-source {hawk} - https://phabricator.wikimedia.org/T120131#2972904 (10Nuria) 05Open>03Resolved [15:55:56] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2972915 (10spatton) Thanks much, @elukey and @Milimetric. I have two additional permissions requests for our fundraising analytics consultants. Could you please add them to the wmf LDAP group, an... [15:58:14] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2972920 (10MoritzMuehlenhoff) @spatton: The "wmf" LDAP group is limited to WMF staff. There's a separate "nda" LDAP group covering access to various web-based services. What in particular do you... [15:58:41] 06Analytics-Kanban, 10Fundraising-Backlog, 13Patch-For-Review: Productionize banner impressions druid/pivot dataset - https://phabricator.wikimedia.org/T155141#2972924 (10mforns) I think the implementation of the jobs is finished now. I've made some changes to the patch: * banner impressions renamed to banne... [15:58:50] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2972926 (10MelodyKramer) Hi @elukey and @Milimetric - I'm still unable to sign in. I believe I am using the correct u and p. Would it be possible to screenshare or evaluate with one of you? [16:01:06] milimetric: standdupppp [16:01:51] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2972961 (10spatton) @MoritzMuehlenhoff, right now we're just looking to get these two users Pivot access, so the NDA group would be just fine. That groups page describes: wmf - for WMF staff/**c... [16:03:12] 06Analytics-Kanban, 10EventBus, 06Operations, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2972976 (10BBlack) cache_misc for this are all implemented and live now. The [[ https://github.com/wikimedia/operations-puppet/blob/production/mod... [16:07:43] 06Analytics-Kanban, 10EventBus, 06Operations, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2972982 (10Ottomata) YESSSSSSSSSSSSSSSSS awesome! Thank you! [16:11:57] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2972985 (10elukey) @spatton: this is an example of a task that I opened for the WMDE folks that needed access - https://phabricator.wikimedia.org/T148832 - we'll probably need to do it again for... [16:21:03] elukey: sorry, meeting: let me have a look [16:22:47] \o/ [16:25:06] elukey: i'm guessing this might fail on first run, if for no other reason than the ferm rules on other hosts [16:25:36] elukey: i mean, the bootstrap will fail [16:26:15] elukey: which, maybe you want, i dunno? [16:28:00] urandom: ah makes sense.. maybe I can run puppet on aqs100[456] before aqs1007 [16:28:11] elukey: sure [16:28:22] i wasn't sure what you were saying yesterday, now that i think about it [16:28:47] were you looking to be able to kick that off separately from the puppet merge, or you want the bootstrap to start then? [16:29:18] if you want it to start, yeah, maybe disable puppet on 1007, and have it run elsewhere first [16:29:32] urandom: it would be great to set up the hosts and then control when to start the instance bootstrap process, but event coupling it with the first puppet run is ok [16:29:32] if you don't, it'll probably fail like this :) [16:29:42] yah [16:29:49] it might fail regardless [16:30:10] well I don't mind as long as the cluster is not affected :) [16:30:13] because of the scap/trebuchet deployments [16:30:19] ah yes [16:30:20] no, no problem there [16:31:25] elukey: the second instance will be really easy [16:31:38] because all of this chicken-egg stuff is behind you [16:31:54] a pupet run will kick it right off [16:32:00] hello analytics! did you happen to catch https://phabricator.wikimedia.org/T156312 ? didn't want it to get lost in the abyss of Phab tasks [16:32:13] urandom: super :) [16:32:26] just some inconsistency with the new per-article monthly and aggregate [16:33:50] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2973069 (10elukey) Melody is able to access now, just followed up on Hangouts. [16:34:25] elukey: if you do have to restart a failed bootstrap, you might want to rm -r /srv/cassandra-a/{commitlogs,data,saved_caches} before restarting [16:34:39] probably not necessary here, but it won't hurt [16:34:41] musikanimal, in 5 minutes we have our tasking meeting, I'll mention this task in it [16:34:51] ok thanks! :) [16:35:22] urandom: noted [16:36:02] urandom: if you are around and don't mind, it would be great to merge in a bit and try to bootstrap the instance [16:36:05] 10Analytics, 10Pageviews-API: Monthly aggregate endpoint returns unexpected results and invalid timestamp - https://phabricator.wikimedia.org/T156312#2970596 (10Nuria) Thanks for ping, Will look into it, hopefully next week. [16:36:28] elukey: yeah, sure [16:37:03] \o/ [16:37:07] ottomata: you pinged me before, what's up [16:40:37] oh, milimetric wanted to talk about some dir names and the sync solution alex came up with [16:40:45] but i'm listening to prometheus talk now, will ping you later? [16:40:50] actually, quick q [16:40:57] i'm considering a directory named 'adhoc' [16:41:02] for the random stuff from stat1002/stat1003 [16:41:05] that is not reports, etc. [16:41:09] thoughts? [16:41:15] instead of common/ or legacy/ [16:41:39] 10Analytics: Add Analytics-Wikistats 2.0 phab project tag - https://phabricator.wikimedia.org/T146043#2973092 (10Nuria) [16:42:14] 10Analytics, 10MediaWiki-API: Copy cached API requests from raw webrequests table to ApiAction - https://phabricator.wikimedia.org/T155478#2973093 (10Nuria) [16:43:21] 10Analytics, 06Operations, 10netops, 13Patch-For-Review: Open temporary access from analytics vlan to new-labsdb one - https://phabricator.wikimedia.org/T155487#2973097 (10Nuria) [16:48:55] 10Analytics, 10Analytics-Wikistats: Technical stack for wikistats 2.0 - https://phabricator.wikimedia.org/T156384#2973128 (10Nuria) [16:49:55] 10Analytics: Pull data for edit reconstruction from labs and push it back after reconstruction - https://phabricator.wikimedia.org/T152788#2973155 (10elukey) [16:49:59] 10Analytics, 06Operations, 10netops, 13Patch-For-Review: Open temporary access from analytics vlan to new-labsdb one - https://phabricator.wikimedia.org/T155487#2973154 (10elukey) 05Open>03Resolved [16:52:24] 06Analytics-Kanban, 10Pageviews-API: Monthly aggregate endpoint returns unexpected results and invalid timestamp - https://phabricator.wikimedia.org/T156312#2973160 (10Nuria) [16:53:39] urandom: I think I'd need to deploy https://gerrit.wikimedia.org/r/#/c/333905/5/scap/aqs-prod first right? [16:54:10] ¯\_(ツ)_/¯ [16:54:17] ahahahaha [16:54:17] elukey: i think so :) [16:54:27] not sure if puppet is able to pull regardless [16:54:45] i still struggle with getting all the Is dotted and Ts crossed wrt scap [16:55:24] feels like there are a lot of moving parts there [16:55:24] puppet should create a brand new repo if not present, not sure if it will be able to pull if not whitelisted in the dsh config [16:55:27] mmmm [16:55:39] (w/ scap i mean) [16:56:28] the patch you mean? If so we realized during the last deployment that we were missing tons of things :( [16:57:16] elukey: i think you're right about merging that patch first. i just meant, i'm never quite confident when it comes to scap [16:57:27] ahh yes me too :P [16:57:46] lots of things need to be a certain way, and i don't do it often enough for it to have been committed to my grey matter :) [16:58:15] it is like the old nintendo game cheats [16:58:23] ha! [16:58:28] right sequence of commands, otherwise no luck [16:58:45] up-down-left-right-a-a-a-b-down-b [16:59:05] WRONG, it should have been: up-down-left-right-a-b-a-b-down-b [16:59:22] how do you call the half-circle move? [16:59:38] left-down-right maybe [16:59:41] ¯\_(ツ)_/¯ [16:59:43] hahahaha [16:59:56] it's scap to me [17:00:04] :D [17:00:07] that'll be my new saying [17:00:29] basically, s/greek/scap/ [17:01:01] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:01:24] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:02:23] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973229 (10Nuria) [17:02:26] ottomata: would https://gerrit.wikimedia.org/r/#/c/333905/5 require only a git pull on tin in the aqs-deploy repo or a full build with docker? [17:02:45] I'd say the former but not really sure with scap [17:04:58] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973236 (10Nuria) [17:05:16] elukey: should only be git pull [17:05:24] docker build thing doesn't deal with scap, only npm stuff [17:06:50] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) Likely to involve some hive to transform data format into something that can be loaded on aqs Create an AQS job [17:07:46] ottomata: super [17:07:56] milimetric: did you see my q ^^^^ way above? [17:08:01] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973266 (10Nuria) [17:08:55] 06Analytics-Kanban: Populate aqs with legacy pageviews on new endpoint - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:10:42] ottomata: I probably need to hear what Alex's solution is first, I'm not sure why we'd need an adhoc or common in the plan we had yesterday [17:10:52] but in meeting now too, we can chat after metrics maybe? [17:11:03] (gonna grab lunch and prep after tasking) [17:11:41] k after metrics is good, alex came up with a non NFS sync solution to one directory [17:11:44] where --delete will still work [17:12:32] 06Analytics-Kanban: Create AQS endpoint to serve legacy pageviews - https://phabricator.wikimedia.org/T156391#2973277 (10Nuria) [17:13:34] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:15:14] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973300 (10Nuria) [17:16:12] 06Analytics-Kanban: Create AQS endpoint to serve legacy pageviews - https://phabricator.wikimedia.org/T156391#2973302 (10Nuria) [17:19:15] 06Analytics-Kanban: Create AQS endpoint to serve legacy pageviews - https://phabricator.wikimedia.org/T156391#2973308 (10Nuria) [17:20:32] 06Analytics-Kanban: Create AQS endpoint to serve legacy pageviews - https://phabricator.wikimedia.org/T156391#2973277 (10Nuria) [17:20:47] 06Analytics-Kanban: Populate reportcard with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973311 (10Nuria) [17:21:49] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:22:15] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:24:26] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973318 (10Nuria) [17:27:20] 06Analytics-Kanban: Populate aqs with legacy pageviews - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) [17:29:04] nuria,joal - (when you have finished the meeting) anything against me deploying AQS ? [17:29:20] elukey: for aqs1007? [17:29:20] elukey: no [17:29:37] joal: nope https://gerrit.wikimedia.org/r/#/c/333905/5 [17:29:44] it will not deploy anything code-wise [17:29:58] elukey: Ah! right :) Please go on :) [17:30:03] super :) [17:30:10] elukey: sorry I gorgot :) [17:32:08] == CANARY == [17:32:09] :* aqs1004.eqiad.wmnet [17:32:09] analytics/aqs/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0) [17:32:12] analytics/aqs/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0) [17:32:15] analytics/aqs/deploy: promote and restart_service stage(s): 100% (ok: 1; fail: 0; left: 0) [17:32:18] analytics/aqs/deploy: finalize stage(s): 100% (ok: 1; fail: 0; left: 0) [17:32:21] canary deploy successful. Continue? [y]es/[n]o/[c]ontinue all groups: [17:32:24] \o/ [17:32:29] Yay elukey :) [17:34:27] all right it does deploy one host at the time [17:34:39] not sure if it depools/repools [17:34:54] and it failed on 1007 for missing deploy-service credentials [17:34:57] good [17:35:24] now that everything is up to date, joal I'd merge the puppet change to bootstrap aqs1007-a (since Eric is around) [17:35:28] wdyt? [17:36:20] elukey: better in our morning, or better now with Eric - Your call :) [17:36:47] joal: well are you going to be around for a bit? [17:37:04] until end of metrics [17:37:33] ah nice, so I can proceed [17:38:26] 06Analytics-Kanban: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#2973369 (10Nuria) * Set up a redirect from old report card to new domain http://analytics.wikimedia.org/reportcard/ which points to the tabs layout * Content it needs to include: Pageviews overall, uni... [17:46:37] 06Analytics-Kanban: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#2973392 (10Nuria) [17:57:57] a-team: aqs1007-a is bootstrapping [17:58:05] awesome elukey [17:58:47] not a big deal (to me at least, my stuff still runs), but i noticed there's a query that's been running on hive for >5 hours now. i would be suspicious something is wrong with it [17:58:50] as predicted from aqs1004 a/b [17:58:54] urandom: --^ [17:59:01] awesome [17:59:25] elukey: it's nice when there are no surprises, eh? :) [17:59:43] ebernhardson: taking care of it [18:00:29] zareen: I'm not the only one anymore to notice your queries are huge - can we plan on doing soemthing? [18:00:55] elukey: if you install cassandra-tools-wmf, then you can do: cassandra-stream -nt nodetool-b to see a nice interactive display [18:00:59] elukey: sort of top-like [18:01:11] errr: cassandra-streams -nt nodetool-b [18:01:25] elukey: ah, ok, deployment done! NICE [18:01:33] elukey: one less thing to worry about [18:01:38] urandom: still didn't check how to install the tool, but it would be nice! I am checking netstats now [18:01:55] elukey: apt-get install cassandra-tools-wmf [18:01:56] nuria: I am not sure if the hosts got depooled though [18:01:57] {{done}} [18:02:03] ah brutally, not via puppet [18:02:17] well I guess I can do a require package [18:02:18] i thought we added the package, but clearly not [18:02:25] to puppet i mean [18:02:34] elukey: i see, i imagine there is not a deppol loog [18:05:18] elukey: i'm probably biased, but there is a lot of handy stuff in that package [18:05:36] elukey: mostly with regard to multi-instance [18:06:13] elukey: like c-foreach-restart and c-foreach-nt [18:06:43] yeah I have still to make myself familiar with the tools [18:06:49] I guess I can install it on 1007 now [18:07:39] urandom: so since I am bootstrapping cassandra-a: cassandra-stream -nt nodetool-a [18:07:42] right? [18:14:16] fdans: testing latest patch on beta [18:14:37] zareen - ping? [18:15:16] joal: probably unrelated, but yarn.wikimedia.org just stopped responding, and hue says it can't talk to the resource manager [18:15:31] it might magically fix itself though [18:15:37] jajaja [18:15:40] ebernhardson: it usually does so [18:16:26] ebernhardson: standby RM as taken over [18:16:50] woo for resilient systems [18:18:21] a-team: we just had a network outage in eqiad (row C related) and some analytics* hadoop hosts were taken out for sure. If you see anything else weird in metrics of various services it might be due to that [18:18:36] thanks for letting us know elukey [18:18:39] ebernhardson: --^ [18:18:50] elukey: primary RM gone [18:19:23] mmm it didn't go down [18:19:25] checking [18:21:38] yes 1002 is the master now [18:21:40] woa [18:21:45] pretty fancy! [18:21:58] just the apache stuff doesn't know how to deal with master switch [18:22:02] same thing for hdfs master node [18:22:16] yeah [18:25:17] (03CR) 10AndyRussG: [WIP] Add banner activity jobs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/331794 (https://phabricator.wikimedia.org/T155141) (owner: 10Mforns) [18:25:59] 2017-01-26 18:09:15,920 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Unable to connect to Zookeeper [18:26:03] 2017-01-26 18:09:15,920 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Ignore watcher event type: None with state:Disconnected for path:null from old session [18:26:07] 2017-01-26 18:09:15,920 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down [18:26:10] 2017-01-26 18:09:15,930 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: RMStateStore has been fenced [18:26:13] 2017-01-26 18:09:15,930 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning RM to Standby mode [18:26:16] 2017-01-26 18:09:15,930 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state [18:26:19] ottomata,joal --^ [18:26:22] an1001 is in row C, but not behind the same switch that was rebooted [18:26:42] but Brandon saw issues in other caching hosts on RowC too not behind the rebooted switch too [18:26:50] hm [18:30:09] confirmed by Faidon, the whole Row C switch stack got impacted for a bit [18:30:30] k [18:30:46] elukey: solved? [18:30:55] yep now everything is fine [18:31:23] elukey: We should traition back 1002->1001 - yarn UI not accessible anymore [18:31:35] elukey: please :) [18:31:41] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:43] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:51] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:59] mmmmm [18:32:10] hmmM! [18:32:13] they still doing stuff? [18:32:41] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:32:44] ah my fault! [18:32:53] ? [18:33:11] probably the NM didn't come up by itself, or it is waiting for puppet to run.. I increased the retries timeout for the NM alarms [18:33:57] ottomata: I am fixing it [18:34:41] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:36:59] joal: let me fix this and then I'll do the failover [18:37:41] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:37:43] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:37:58] elukey: I don't understand why NM went down :( [18:38:17] probably it lost connectivity and decided to shutdown [18:38:21] checking [18:38:51] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:39:53] elukey: I'll take care of restarting the oozie failed stuff [18:40:53] joal, I can restart the FR druid loading [18:41:03] thx [18:41:17] joal: yeah basically there are some blerghs in the logs and then a shutdown (checked an1028) [18:41:19] jobs failed because of this? [18:41:23] mforns: FR druid is not production, nonexistant to me ;) [18:41:31] yes ottomata [18:41:31] xD [18:41:58] ottomata: applications having their appMaster on crashed NM failed [18:42:14] hm [18:42:17] on nodemanager [18:42:18] sorry [18:42:19] yeah ok [18:42:29] hm [18:43:01] ottomata: there currently is more than 10 jobs running - having a single failure out of a shake like that is not that bad ;) [18:43:25] yep :) [18:48:45] elukey: looks like ETA 1.5 days [18:49:07] the rate seems a little low [18:49:07] 10Analytics, 10ChangeProp, 10Citoid, 10ContentTranslation, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2973625 (10mobrovac) 05Open>03Resolved All of the services but Maps have been upgraded to Node 6, so I'm declaring victory here. Thanks to everyone that helped!... [18:54:11] fdans, ottomata : last patch of EL UA changes on beta looking good. [18:55:36] cool [18:56:10] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2973692 (10EBernhardson) I checked a week worth of our data, we are talking about 160 to 180 million lines per day currently. This will increase as we ship out the sis... [18:56:46] /msg fdans on irc you can follow metrics meeting chat on #wikimedia-staff (private channel) [18:57:07] sorry fdans : on irc you can follow metrics meeting chat on #wikimedia-staff (private channel) [18:58:22] nuria: I need to be invited or something to that channel right? [18:58:23] 10Analytics, 10ChangeProp, 10Citoid, 10ContentTranslation, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2973700 (10akosiaris) 05Resolved>03Open Actually, there's etherpad left. I 'll do the upgrade tomorrow though :-). Reopening in the meantime [18:58:45] glad the changes are 👌🏼 [18:59:12] fdans: wait, is #wikimedia-office, please try to join [18:59:21] yeah I'm in :) [19:02:28] urandom: not in a real hurry, 1.5 days seems not that bad [19:02:55] urandom: for the super awesomeness I need to run cassandra-streams -nt nodetool-a right? [19:03:09] elukey: yeah [19:03:35] sorry joal, was afk, back now [19:04:04] urandom: niceeeeeee [19:04:16] zareen: in meeting, and will be gone after - The point is, your requests are too big [19:04:23] elukey: :) [19:04:40] urandom: it is really coincise and clear [19:05:03] joal: on aqs1007.eqiad.wmnet run cassandra-streams -nt nodetool-a [19:06:32] awesome elukey :) [19:07:02] I am looking forward to test the other commands [19:07:03] zareen: we can help you trim down your selects, looks like you are swaping through too much data [19:07:22] joal: okay, i was planning to test out the sampling method that milimetric sent earlier this week. [19:07:31] zareen: we normally run tests on samller slivers of data [19:07:34] *smaller [19:07:44] joal: all good if we failover an1002 to an1001? [19:07:58] let's go elukey, will be easier to monitor ;) [19:08:03] zareen: you can also reduce the timeperiod you are swapping [19:09:08] zareen: other way is the one we discuss - extract portions (do one big job) then smaller ones [19:09:25] elukey: try: c-any-nt status -r [19:09:54] zareen: One huge job a day is too much for the cluster :) [19:10:03] nuria: right, i've tested the queries to make sure they give the expected results, but am now generating the actual data to explore [19:10:07] for those occasions when you don't care which instance you run against, and/or you need something portable (like for a script or something), that won't be broken by different naming conventions [19:10:16] nice! [19:10:29] joal: proceeding with RM! [19:10:36] elukey: thanks :) [19:10:45] elukey: c-foreach-nt iterates over the instances instead, but that'll be less interesting on 1007 for obvious reasons :) [19:11:16] :D [19:11:31] elukey: https://wikitech.wikimedia.org/wiki/Cassandra#cassandra-tools-wmf [19:11:37] *actual* documentation [19:11:53] not claiming it's good, but actual in the sense that it does exist [19:12:19] zareen: Could we look at select to see how to improve it so it actually runs? [19:13:13] joal: yes, i discussed this with HaeB and we decided it's something we should implement, but right now as i'm still exploring the data (and the fields i'm using may change) we wanted to hold off [19:13:37] zareen: fair, but exploration can't happen on 50Tb :) [19:13:44] zareen: with the amounts of data we handle that might not be an option [19:14:38] zareen: not every select you can think of is "runnable", makes sense? [19:14:45] joal: sure, that's a valid point [19:16:03] joal nuria: i'll discuss this will HaeB today, but will hold off on running any other large queries [19:16:42] thanks zareen :) [19:16:47] aren't you folks attending the metrics meeting? [19:17:04] HaeB: I listen and write at the same time :) [19:17:07] ..milimetric is presenting right now on your team's work [19:17:29] zareen: Thanks ! Don't hesitate to come to us, we (hopefully) can help [19:18:11] joal: an1001 is now both RM and HDFS master [19:18:17] !log restored an1001 as RM and HDFS master [19:18:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:18:26] yarn.w.o works fine [19:18:32] joal: thanks, will definitely reach out if needed :) [19:18:49] thanks elukey :) [19:20:45] !log Restart webrequest-lood-coord-text 2017-01-26T15:00 after cluster shake [19:20:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:21:38] joal: if everything is ok I'd go home :) [19:22:37] (logging off now and checking in ~30 mins just to be sure :) [19:22:54] bye elukey :) [19:34:15] 06Analytics-Kanban, 13Patch-For-Review: Run a 1-off sqoop over the new labsdb servers - https://phabricator.wikimedia.org/T155658#2973803 (10JAllemandou) It works ! Some takeovers: - Only two differences in one schema: archive table misses ar_content_format and ar_content_model - Some projects not available (... [19:40:53] 06Analytics-Kanban, 10Fundraising-Backlog, 13Patch-For-Review: Productionize banner impressions druid/pivot dataset - https://phabricator.wikimedia.org/T155141#2973811 (10AndyRussG) Hi, all! Thanks so much once again for working on this!!!! :D Here are some notes and questions: - Region data is important. I... [19:48:07] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2973872 (10JAllemandou) Thanks for the answers :) Two things I forgot (which kinda have an importance) : - What smaller-time granularity are you willing to be able to... [19:51:41] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Compute the tending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2973889 (10mobrovac) [19:51:59] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2973903 (10mobrovac) [19:53:55] mforns: Just did a quick check of shard size for the new dataset in druid: very different from the first we loaded, and from the real-time ones [19:54:09] mforns: Let's discuss / review that tomorrow :) [19:54:09] joal, ? [19:54:29] joal, I'm loading the monthly job with 2 shards right now, should I interrupt it? [19:54:39] mforns: no, let's wait and see [19:54:42] ok [19:55:01] 10Analytics, 10EventBus, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411#2973918 (10Jdlrobson) w00t [19:55:14] mforns: daily (2016-12-01, 2, 3 are between 70 and 80Gb [19:56:14] joal, ??????? [19:56:17] yessir [19:56:22] weird, huh ? [19:56:58] joal, we checked those in WMDS and they were 14 MB each day (hourly res) [19:57:21] WMDS? [19:57:28] I imagine having them minutely, multiplies by 100-200? [19:57:42] Dev Summit [19:57:59] but not until 80GB! [19:58:19] mforns: right - seems bizarre - realtime data (basically the same, minute oriented, daily segments) are ~7Mb ! [19:58:35] mforns: we should check the difference of volume though [19:58:48] from december to january? [19:59:02] it's like x4 [19:59:04] mforns: It's probably legitimate that end of January banner are 10 times less than end of deceber :) [19:59:09] but not a lot more... [19:59:55] mforns: going for diner - let's catch back on that tomrrow :) [19:59:57] 10x OK, 10000x O.O [20:00:03] ok, see you tomorrow! [20:00:34] mforns: OOOH, Just realized: it is 70-80MB, not GB !! [20:00:42] Didin't notice my typo [20:00:43] joal, aaaaaaaaaaah!!!!!!! [20:00:47] sorry for that [20:00:50] OK OK, fiuuuu [20:00:59] yes yes, way better :) [20:01:02] this makes sense [20:01:08] ok cool, you can sleep gently ;) [20:01:11] hehe [20:01:30] tomorrow a-team, and kudos again to milimetric for the awesome talk :) [20:01:37] good night joal :] [20:01:39] nite [20:01:41] thx [20:02:11] people loved it in IRC, I'm reading the backlog, great job everyone (literally everyone) [20:02:25] including alumni [20:02:35] this is years of work [20:07:49] :) [20:24:19] 06Analytics-Kanban, 10Fundraising-Backlog, 13Patch-For-Review: Productionize banner impressions druid/pivot dataset - https://phabricator.wikimedia.org/T155141#2974065 (10AndyRussG) Mmm it seems I may have spoken too quickly about region vs. minutely, apologies... If it turns out that it's necessary to choos... [20:25:08] joal: for context, does the query that updates the unique devices numbers (processsing the same data as in zareen's query) use TABLESAMPLE, and if yes, at which sample rate? [20:28:00] we're definitely aware that it's a lot of data (zareen and i discussed this weeks ago already), and of course don't want to bring down the server. OTOH, as she said, this is only exploratory work so far and there needs to be a balance between optimizing work and the server time saved [20:30:37] say there's a one-off query we can shorten from 8h to 2h wall clock time with some optimizing work - is that worth two hour's worth of work by zareen (or other analysts, for that matter)? i think not [20:31:32] on the other hand, once we productionize a metric with a daily oozie job, say, that's of course worth quite some though on how to minimize resource usage [20:35:03] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2974076 (10EBernhardson) For time granularity, i think daily is probably sufficient for all or our use cases. If it makes a big difference in data size we could perhap... [21:11:59] that's me for today, night! (awesome job on the prez milimetric) [21:12:51] prez as in presentation, not as in Roland "Prez" Pryzbylewski from The Wire [21:12:53] night! [22:05:02] nuria: do I need to replace all instances of your username in https://gerrit.wikimedia.org/r/#/c/327845/16/oozie/maps/druid/README.md with mine? [22:07:37] 06Analytics-Kanban, 13Patch-For-Review: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#2974327 (10Ottomata) I talked with @akosiaris this morning, and he suggested I try to do some fanciness with deleting the destination directory and hardlinks. I think I got something! https://g... [22:08:02] milimetric: still around? [22:08:39] yes, hi ottomata [22:09:23] batcave real quick about datasets? [22:10:05] yes, omw [22:36:45] 06Analytics-Kanban, 13Patch-For-Review: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#2974379 (10Ottomata) Talked with dan again, and we decided that since we are doing T132594 anyway, we might as well make things easy on ourselves and create a brand new structure at analytics.wik... [22:44:48] ottomata: this will do fine for a starter README: [22:44:50] https://www.irccloud.com/pastebin/ve82KxB9/ [23:18:18] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2974534 (10debt) p:05Triage>03Normal [23:20:33] bearloga: sorry, i missed your ping, still There? [23:20:59] nuria: no worries! and yep, still here :) [23:21:54] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2974548 (10MusikAnimal) [23:22:19] 06Analytics-Kanban, 10Pageviews-API: Monthly aggregate endpoint returns unexpected results and invalid timestamp - https://phabricator.wikimedia.org/T156312#2970596 (10MusikAnimal) [23:24:24] HaeB: I think it would help to document queries somewhere so we can talk about concrete improvements. Exploratory work doesn't likely need 50tb to be "explored" (cc zareen) as it is likely that user queries that swap so much data might not finish at all. [23:26:44] HaeB: things to look into (maybe helpful, hard to know w/o looking at specifics), https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression [23:28:25] HaeB: useful when a dataset is reused within the same query (cc zareen ) . See calculations for last access uniques: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/last_access_uniques/daily/last_access_uniques_daily.hql [23:32:16] HaeB: also, two months of data is probably two much to parse at any one time (most likely query would fail, although not sure 100%) so i would not try to go parse more than 1 month ad most and would try to explore smaller time ranges first. [23:38:16] 10Analytics-Tech-community-metrics, 06Developer-Relations (Jan-Mar-2017): Merge detached SCM and SCR identities in korma DB if they have the same email address but not the same uuid - https://phabricator.wikimedia.org/T156283#2974657 (10Aklapper) 05Open>03Resolved Alright. I expected some more here but on...