[07:47:05] good morning to you oozie [07:47:12] always a pleasure to read emails from you [07:47:14] :/ [07:48:52] !log created 0038054-160922102909979-oozie-oozi-C to re-run webrequest-load-check_sequence_statistics-wf-upload-2016-10-20-3 (oozie errors) [07:49:40] !log created 0038058-160922102909979-oozie-oozi-C to re-run webrequest-load-check_sequence_statistics-wf-upload-2016-10-20-5 (oozie errors) [08:10:14] joal: morninggg [08:10:28] we need to reboot $everything [08:11:27] except our routers :-) [08:11:41] let's reboot them too! :P [08:12:18] moritzm: I am going to send an email also for the stat boxes [08:12:35] because the last time I broke several people's jobs [08:12:41] and they were not super happy [08:12:47] ok, thanks. currently in the process of installing fixed kernels on those [08:25:20] elukey: Hiiii !b [08:25:45] elukey: please rebott me as well, I'd feel better with some kernel updates ;) [08:26:40] hahahahah [08:26:57] so I am going to start with AQS and EventLogging [08:27:02] K [08:27:10] elukey: Can I help for anything? [08:27:10] then I'll send an email to engineering@ for stat boxes [08:27:37] joal: if you could keep an eye on metrics and let me know if you see fire it would be great [08:27:46] elukey: sure :) [08:27:48] or if you see me rebooting things in a weird order [08:27:49] :D [08:28:02] like "rebooting analytics100[12] [08:28:21] huhuhu [08:28:21] :) [08:28:33] ok elukey, will keep an eye open [08:28:47] elukey: or let's say, as much open as it can be ! [08:29:55] :D [08:37:20] joal: I suspended new-cassandra-bundle just to be sure.. all the coordinators are stopped, but I feel better with it stopped [08:37:38] elukey: cool ! [08:38:07] elukey: I'll also put a reminder to myself to restart this bundle from prod code at the beginning of the month [08:46:58] elukey: I need to trop off for about an hour [08:47:07] elukey: sorry :( [08:48:01] joal: it is super fine! I am going to do aqs and EL during the next 30 mins, the hadoop cluster is battle tested :D [08:48:27] this is the first time that we are rebooting the new aqs cluster so I am a bit paranoid [08:48:50] aqs1004 is up and running after the reboot, nodetool is happy [08:52:20] https://grafana.wikimedia.org/dashboard/db/aqs-elukey?from=now-3h&to=now - it seems that it didn't go super smooth [08:52:57] but yeah only a bit of impact, probably ongoing connections [08:53:08] even if I didn't see them with tcpdump [08:53:14] will wait a bit more for aqs1005 [08:54:18] ok while waiting I am going to reboot eventlog1001 [08:57:34] (CR) Hoo man: WikidataArticlePlaceholderMetrics also send search referral data (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) (owner: Addshore) [09:07:34] Eventlogging looks good [09:07:49] Going to proceed with aqs1005 [09:26:35] cassandra up on aqs1005, all good so far [09:31:33] moritzm: how soon should we restart stat100[234] ? I am preparing the email to send and I'd like to set a dealine (I am also trying to contact people owning screen/tmux sessions) [09:37:01] how about tomorrow morning? that should also give the SF staff a chance to react [09:44:57] elukey: Back ! [09:45:14] elukey: from the 24h graph I look at, aqs seems fine :) [09:45:33] moritzm: good to me, I just sent a personal email to all the tmux/screen session owner on stat100[234] [09:45:43] I'll also send an email to engineering [09:46:04] joal: super thanks! aqs1006 is the only one left, doing it now [09:46:48] elukey: p99 latency has a bump when you reboot, but that's really a big deal (as you said, probably ongoing connections [09:48:17] yes some timeouts or something similar cause this, we can't really avoid it [09:56:58] aqs1006 done [09:57:00] and pooled [09:57:22] awesome elukey :) Thanks for doing this ! [09:57:31] :) [09:57:39] now the fun begins, hadoop :) [09:57:43] hehe [09:58:07] Analytics: Some recent ExternalLinksChange data lost - https://phabricator.wikimedia.org/T146815#2730905 (Samwalton9) Open>Resolved a:Samwalton9 Just going to close this in favour of discussing ongoing issues at T115119, since it does seem like this is a direct schema issue. [09:58:56] elukey: we can try to prevent failing job by checking application masters [09:59:18] yep yep good idea! [09:59:23] I have suspended the bundles [09:59:23] cool [09:59:32] and will wait a bit before starting [09:59:40] going to prepare the email to announce the restarts [09:59:40] elukey: Have you stopped camus? [09:59:52] nope, will do it in a sec [09:59:56] ok :) [10:00:31] it is running atm, but I'll disable the cron in the meantime [10:00:44] elukey: That's the way to go :) [10:00:57] elukey: Recall we should do it via puppet to prevent the bug we had last time ;) [10:02:02] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, and 2 others: Implement Schema:ExternalLinksChange - https://phabricator.wikimedia.org/T115119#2730912 (Samwalton9) Looks like we're running into some problems, but it's hard to pinpoint why. I'v... [10:02:09] so there are two ways [10:02:21] you can comment and then uncomment paying attention on the next puppet run [10:02:28] or just comment, wipe the crontab, run puppet [10:02:53] I am extra careful after that issue [10:02:57] :) [10:03:23] elukey: I completely trust you :) [10:04:48] !log stopped camus on an1027 and all the oozie bundles as prep step for the reboots of analytics* [10:22:39] mmm isn't wiki-research-l@lists.wikimedia.org the right email for research? [10:22:51] yes [10:23:00] hello! [10:23:02] hi :) [10:23:05] I got bounced [10:23:13] you need to be on the list [10:23:23] ahhh makes sense [10:23:30] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l [10:23:53] (I think I just got added this year) [10:25:34] ok email sent [10:25:49] this time I am alerting people with screen sessions on stat boxes, engineering, analytics, research [10:26:03] should be enough :D [10:26:17] yes [10:26:50] I don't know if you really need to take the trouble to look for screen sessions, but that's very nice of you [10:27:29] elukey: 6 jobs runnin [10:27:41] yep I am watching Yarn [10:27:48] for the moment I have only sent emails :P [10:27:49] Do you want us to move or to wait a bit more ? (4 prod) [10:27:53] :) [10:28:09] I think that we can wait a bit [10:28:17] there's no real hurry [10:28:24] I need also to plan kafka reboots [10:28:45] that implies also the main clusters [10:28:48] mobrovac: --^ [10:30:05] kafka reboots for the kernel updates? [10:32:01] yeppa [10:34:34] when is that happening elukey? [10:34:52] mobrovac: whenever you prefer but asap [10:35:04] I pinged you to ask how/when you want to do it [10:35:09] ah i see [10:35:18] elukey: let's do it now, starting with codfw [10:35:25] sure [10:36:09] you restart both boxes, and i'll restart changeprop after that to be on the safe side [10:36:35] elukey: after you restart and make sure that kafka is up, we'll also need to ensure the eventbus proxy service is up and kicking [10:44:42] yep! [10:50:54] elukey: eta? [10:51:33] now :) [10:52:04] doing kafka2001 in 2 mins [11:01:21] done, proceeding with kafka2002 [11:06:20] mobrovac: main-codfw done [11:06:29] ok, restarting [11:16:06] starting Hadoop restarts! [11:16:11] elukey: cool [11:16:22] elukey: let me tell you which nodes not to touch yet please ;) [11:16:34] already checked app masters :) [11:16:44] ok, you're faster than I am :) [11:16:53] 1034/1037 and 104somethingthatIdon'tremember [11:19:39] elukey: oh you're not continuing with the main kafka eqiad cluster straightaway ? [11:19:58] if that's the case, i'm going to have lunch [11:20:22] mobrovac: let's do it this afternoon ok? [11:20:34] w4m [11:20:43] super, enjoy lunch :) [11:21:07] grazie! [11:25:01] elukey: 1034 and 1037 have finished, you can restart them :) [11:25:59] gooooood [11:26:25] TestEditHistoryRunner seems on 1034 though [11:26:29] elukey: but we're gonna have trouble with 1047: ellery's job is huge and not yet finsh [11:27:09] aouch, really ? [11:27:19] elukey: Thanks for having double Checked ! [11:27:38] I'll skip it, we can leave a couple of nodes behind [11:27:39] :) [11:27:41] elukey: Sparj tellws me an IP, not a name [11:32:15] super paranoid with journalnodes, the last time analytics1001 stopped :P [11:32:59] (CR) Joal: "One comment" (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [11:37:00] elukey: my job dies, you can restart 1034 ;) [11:39:28] hi a-team, are you firefighting? can I help? [11:40:06] mforns: hola! We are rebooting all the things for a kernel upgrade, nothing on fire [11:40:12] only boring :( [11:40:29] elukey is on fire, yes ! He restarts nodes since teh early morning ! [11:40:33] elukey, I though it was tomorrow [11:40:38] xD [11:40:39] Give the man a break ;) [11:41:06] (CR) Joal: "One comment in file." (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/305989 (https://phabricator.wikimedia.org/T142955) (owner: Addshore) [11:41:32] mforns: :P :P :P only the stat nodes! [11:41:41] otherwise people will kill me [11:41:45] ah ok ok [11:41:58] (CR) Joal: [C: -1] "Same problem as usual: spaces instead of tabs." [analytics/refinery] - https://gerrit.wikimedia.org/r/316838 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [11:43:11] (CR) Joal: "@nuria: Waiting for the java code to be deployed in a new jar before being able to dpeloy that patch (with potential changes in jar number" [analytics/refinery] - https://gerrit.wikimedia.org/r/315241 (https://phabricator.wikimedia.org/T147841) (owner: Joal) [11:44:24] elukey: have you restarted 1034? [11:44:28] Can I relaunch my spark? [11:45:04] it is in progress, finishing in a bit :) [11:45:14] elukey: ok great, thanks :) [11:46:16] joal: done! [11:46:35] Thanks elukey :) [11:49:08] joal: Is ellery using Hive right? I was wondering if I could have rebooted an1003 but probably it will need to be postponed [11:49:28] elukey: You're right [11:50:11] maybe I can restart an1027 in the meantime [11:51:47] elukey: an1027 is camu machine, right? [11:52:04] yep! [11:54:55] ouch joal sorry, I didn't see that the spark job was on an1043 :( [11:55:24] elukey: that is life :) [11:55:41] elukey: test jobs, no big deal [11:56:53] new one is on 1043 so I will not destroy it anymore :) [12:10:59] ok except 100[12], 1047 and 1003 we are good [12:11:27] I'd be super happy to do 1003 and 1047 together before re-enabling the whole thing [12:14:38] ok so I am going to grab a bite very quickly and then we can decide what to do [12:15:05] we could think about re-enabling everything and then do 1003/1047 tomorrow [12:17:48] * elukey lunch! [12:24:25] mforns: Heya, are you somewhere around? [12:30:24] joal, hey! [12:30:28] what's up? [12:30:46] mforns: would you have a minute exchanging on some candidate? [12:30:54] joal, sure! [12:30:59] batcave [12:31:03] OMW [12:33:46] mforns: I'm fighting with the test setup, awful stuff, but I have a meeting with Erik in 30, sorry to leave you hanging with that refactor [12:35:05] milimetric, np, I've been doing task reviews [12:37:26] goooood ellery's job has finished [12:37:45] spawned another one on 1057 [12:38:12] proceeding with 1047 and 1003 then [12:43:42] milimetric, will be afk for lunch, see you in a bit [12:44:34] elukey: when back, ellery job is finished ! [12:44:48] you can restart :) [12:45:15] joal: already done 1003 and 1047 :) [12:45:26] everything seems good to me now [12:45:28] mind to double check? [12:45:32] 1057 is ok [12:45:43] elukey: I'm running spark, and things look ok [12:46:02] even hue is no complaining, now mysql on 1003 comes up nicely [12:47:03] elukey: Yay ! this is a complete success full reboot :) [12:47:22] elukey: less than a day, and nothing complains :) [12:48:55] elukey: And I can see that other jobs have not been restarted yet, full cluster for my own little tests ;) [12:51:22] \o/ [12:51:30] joal: shall we reboot also 100[12]? [12:51:37] I can surely do 1002 now [12:51:45] elukey: please go ahead with 2 [12:53:31] doing it now [12:57:09] completed [12:57:30] we can do 1001 tomorrow or now [12:57:43] joal: --^ [12:57:53] please go ahead [12:57:59] elukey: Better with everything done [12:58:14] elukey: jobs are not running, no risk , let's do it [12:58:31] yeah same thought [13:03:49] 1002 is the new master now [13:04:55] rebooting 1001 [13:09:23] Analytics-Visualization: Bug with sorting in some pages due to string sorting instead of numerical sorting - https://phabricator.wikimedia.org/T147749#2731345 (Samwalton9) Seems unrelated to TWL. [13:10:32] elukey: not yet rebooted I guess [13:10:33] all right forcing the failover again to 1001 [13:10:40] elukey: same timing: ) [13:13:18] elukey: Thanks mate, back on track :) [13:13:37] completed! [13:13:48] !log re-enabling oozie and camus after cluster reboot [13:14:20] elukey: arf, I can't keep all the computing resources for only myself ? :( [13:14:46] elukey: I really need to share with damn oozie boy? [13:14:48] :D :D :D [13:14:53] you know oozie [13:14:57] he complains otherwise! [13:15:06] He has already started [13:15:08] ;) [13:21:58] elukey: I guess you see my spark job not willing to release resources ;) [13:26:02] :P [14:03:24] joal: https://yarn.wikimedia.org/proxy/application_1476969128131_0063/ :) [14:03:42] elukey: This makes my day :) [14:04:03] it is still live hacking but I found the issue [14:04:14] * joal sends a huge cookie to elukey :) [14:07:27] elukey: what was it? [14:08:08] ottomata: o/ https://httpd.apache.org/docs/2.4/mod/mod_proxy_html.html#comment_3329 [14:08:34] I tried to make a tunnel from my localhost:8088 to stat1001 port 80 [14:08:43] and chrome failed for a decode error [14:09:05] and then I remembered about the comment [14:10:38] hM! [14:10:51] elukey: also, thanks for reboots, lemme know if you need any help with those [14:11:51] ottomata: first time that a Hadoop reboot goes smoothly, I am writing it in my calendar :) [14:12:08] I left out kafka's main-eqiad and analytics [14:12:26] but I can do main-eqiad tomorrow with Marko [14:12:43] ok [14:13:08] so if you have time to do kafka analytics during your daytime I'll complete the work tomorrow morning CEST [14:13:19] ok, i should have time today [14:13:22] ah also you may want to check kafka2001 [14:13:24] i can start now even [14:13:25] oh? [14:13:31] because mirror maker stopped again.. [14:13:38] hm on 2001 [14:13:38] hm [14:13:48] did you just restart it? [14:13:51] (or puppet?) [14:13:52] yep! [14:13:54] hm ok [14:13:56] I did manually [14:13:59] ottomata: new kernels already installed on kafka*, only needs the rolling reboot [14:14:08] ok great moritzm thanks [14:20:24] !log starting rolling restart of analytics-eqiad kafka brokers to apply kernel update [14:21:25] elukey: ? do these graphs look weird to you? [14:21:26] https://grafana.wikimedia.org/dashboard/db/kafka?from=1476851936873&to=1476973229471 [14:22:21] woa [14:22:24] yes [14:22:29] jmxtrans weirdness? [14:22:32] i guess so [14:22:33] but [14:22:35] i dunno [14:22:49] beacuse if you hover over, you see brokers in the tooltip that aren't in the graph [14:22:55] so probably grafana weirdness? [14:23:13] but kafka1018 is not there [14:23:16] mmmmm [14:23:37] I recall that Riccardo showed an icinga alerts yesterday about missing datapoints for kafka1018 [14:23:38] true [14:23:44] but I forgot to follow up [14:24:03] hm [14:24:04] [20 Oct 2016 14:23:51] [ServerScheduler_Worker-8] 761746130 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error [14:24:04] java.nio.BufferOverflowException [14:24:09] kicking jmxtrans [14:25:26] ahhhh snap [14:25:40] joal: fix is now permanent, enjoy spark shells :) [14:25:54] YAY ! [14:26:00] :) [14:26:20] Analytics-Kanban: Make yarn.wikimedia.org correctly proxy to Spark UI - https://phabricator.wikimedia.org/T147927#2731520 (elukey) [14:27:37] ok well, elukey i'm proceeding with broker restarts, starting at 1012 [14:27:41] btw, the quarterly review meeting is in 30 minutes, the meeting is on the WMF staff calendar and I have the bluejeans link if you need (not sure it's ok to paste) [14:27:45] jmxtrans looks ok after restart [14:27:47] on 1018 [14:27:49] OH [14:27:57] ya [14:27:59] ok [14:28:04] so no standup today? [14:28:53] yes, no standup [14:29:25] nuria: do you want a pivot screenshot of the Chrome 41 bug? I find that one's fun to explain [14:32:33] just in case you do: [14:32:33] https://pivot.wikimedia.org/#pageviews-daily/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfABtSYag2LWqANycBXcV2DMwxBmC8AEwADACMAGwAtCHRYQAscCEhuMmpIQB0ySEAWkbkACbB4dEhAJwxYUkpaclZyXmKwGAAnlhewHAAkgCyIBIASgCCAHIA4iAKijqq7PrEhUb0TKxzFphWBCSGSsYrZpyWvAJCoh3uxBIARgwYAO7MDhL8oqTWLQp [14:32:33] Kumve3+b4GBcDmsxBwu2Wph+RxsdgcTlcHi86EeYDgbQ6AGU4AMuuMjNZqGJyGANLhNMAEIRbnIALpNVrtXgYkBwKbyeR05DaGgQ1b/aF8aiCJTCOT4C7XW4PJ4iYgAKwwDE+PkVYCGs35yg1+mYqqWJj5hw2xyFpzFwBcpGodwkEAw7mJyoKpCY2t4BRYEGohQoAHMpmgeQaDusePgTiKzpJpLJxE0ru4IABrahqt2/dPNWMAIUTKcCSgK7kcel4AAUwgARZW6gLq0sZhsqgL6/ZQ43h02R81SGRyJrO11N4jML0+8j+9k05DkdzWaxKS3W232x1ci1Wm12h2Baeaaez+eB51EkPiy43e6P [14:32:34] BgSWUKpVKWwiOx4ACs8iAA [14:32:37] ew!!! [14:32:38] lol [14:32:56] https://goo.gl/8rA9to [14:37:06] thanks milimetric [14:37:30] milimetric, oh! I though this was the blue jeans link, hehehe [14:37:36] *thought [14:37:54] https://bluejeans.com/569999548/ [14:38:02] thx! :] [14:43:28] milimetric: what about tasking ? Are we skipping it as well? [14:43:33] or only standup? [14:47:08] elukey, I guess when qr ends, we'll fallback to tasking. nuria had the idea of using the first part of it to talk about how to improve the tasking meeting, but I may be wrong? [14:51:42] there is no wrong :) [14:51:59] we can resume tasking after the qr if everyone's ok with it [15:00:56] a-team: I am in batcave ... standup? [15:01:09] nuria: QR? [15:01:54] joal: our QR is at 10 , ops one right? [15:02:16] Oooh, right [15:02:36] ohhh [15:02:37] so standup yes? [15:02:43] ok joining xstandup [15:02:52] ok [15:03:21] milimetric: seems that current QR meeting is not ours [15:03:31] er? [15:03:38] nuria: are you still in batcave? [15:03:42] a-team: I sent you invites to our quaterly, did you get them, it is at 10 am PST [15:03:45] oooh yeah!! [15:03:48] it's at 1pm [15:04:00] cc ottomata elukey joal mforns [15:04:16] nuria I'm in it, thanks [15:04:37] mforns: it's the next one https://bluejeans.com/411326803/ [15:04:42] a-team: Let's do standup [15:04:50] nuria, oh! [15:04:52] ok [15:05:21] trying to join [15:05:58] need to reboot chrome, argh [15:06:13] mforns: you comin to standup? [15:09:46] Analytics, Commons, Multimedia, Tabular-Data, and 4 others: Review shared data namespace (tabular data) implementation - https://phabricator.wikimedia.org/T134426#2731622 (Yurik) [15:15:50] gah, elukey looks like disks got swapped aroudn on 1020 after reboot [15:15:51] doh! [15:16:00] argh [15:16:05] will do the uuid thing [15:39:03] milimetric, nuria, I see some kafka-related errors in eventloging:/srv/log/upstart/eventlogging_processor-client-side.00.log http://pastebin.com/fbL1Cy3W [15:39:38] cc ottomata thus the low throughput, maybe eventlogging is going to need a re-start too? [15:42:01] I can do that, won't hurt [15:42:30] hm, those are normal errors though [15:42:40] it should do that during every restart for a sec [15:42:46] can't hurt though [15:42:49] to restart el [15:42:53] i've still got a few brokers to reboot [15:42:56] ottomata: no leader for partition shouldn't happen no? [15:43:06] 'NotLeaderForPartitionError' [15:43:08] not 'no' [15:43:08] :) [15:43:13] ahhhhh sorry [15:43:15] :) [15:43:40] !log restarted EventLogging after throughput drop [16:00:19] Analytics-Kanban, Operations, Performance-Team, Reading-Admin, Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2731863 (dr0ptp4kt) [16:06:18] lzia: what do I need to do to make the tool labs survey form live before I send out the emails? [16:10:26] Analytics, Analytics-Cluster: Audit fstabs on Kafka and Hadoop nodes to use UUIDs instead of /dev paths - https://phabricator.wikimedia.org/T147879#2731893 (Ottomata) Did some Kafka reboots today and ran into the issue where /dev numbers are rearranged after reboot. I had to manually use UUIDs in fstab... [16:15:32] Analytics, Commons, Multimedia, Tabular-Data, and 3 others: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...) - https://phabricator.wikimedia.org/T120452#2731897 (Yurik) [16:17:17] done with reboots, gonna kcik el one last time [16:17:37] !log restarting eventlogging after rebooting kafka brokers [16:28:47] (CR) Nuria: Enhancing regex to support pageviews to non-knowledge wikis (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [16:47:50] ottomata, EL looks good in general, but it seems that the kafka restarts affected the sending of the metrics to grafana [16:49:08] and restarting EL didn't fix it [16:53:12] mfyeah something is weird with the kafka dash metrics too [16:53:13] checking jmxtrans [16:57:09] hm yeah jmxtrans is having problems [17:01:18] vk seems fine now [17:05:47] hi bd808. let me check it. [17:06:30] lzia: thanks. I'm waiting on advice from ops-l on sending the emails. [17:06:39] but I hope to get them out today [17:06:49] !log created 0000294-161020124223818-oozie-oozi-C to re-run webrequest-load-check_sequence_statistics-wf-upload-2016-10-20-13 (oozie errors) [17:08:13] bd808: if you're using old Google Forms, you should o to Responses tab and then click on Not accepting responses [17:08:24] I just clicked on that and now it says Accepting responses, bd808 [17:08:56] bd808: if you want, you can also submit a test response just to be sure, but from what I can see, it's on now. :) [17:29:46] lzia: I filled in a test response (well actually a real response but from an insider). I don't have access to the speadsheet to see if it looks right there. [17:32:25] gave you access bd808. sorry. [17:32:36] your response is registered bd808. [17:32:53] lzia: no worries on the access. I never needed it before :) [17:33:57] * bd808 updates the link in the email template to point to the right form [17:35:44] a-team: we might just had an outage with kafka, connections were dropped [17:35:52] ok [17:35:55] :( [17:35:58] we are investigating it but if you see jobs failing this might be the issue [17:35:59] need help? [17:36:07] yea [17:36:58] no no now it is recovered [17:37:03] or at least it seems [17:37:12] oh cool [17:39:40] * milimetric appreciates the kafka whisperers very much [17:48:20] team going afk for a bit, I am a bit tired from today :D [17:48:34] will re-join in an hour to check if everything is ok [17:57:51] Analytics-Kanban: Tech Talk: Pivot - https://phabricator.wikimedia.org/T148776#2732347 (Milimetric) [17:58:31] Analytics, MediaWiki-API, Reading-Infrastructure-Team: Add pageview stats to the action API - https://phabricator.wikimedia.org/T144865#2612941 (Tgr) [18:18:17] mforns: I'm taking care of the 2 oozie jobs not yet relaunched [18:18:41] joal, oh can I watch? [18:18:49] mforns: you pay ? [18:18:51] :-P [18:18:59] mforns: sorru , too easy [18:19:06] mforns: sure, let's do it together [18:19:17] xDDD [18:19:17] batcave [18:19:20] omw [18:19:51] mforns: I'm very much puzzled by the karma setup, I'm going to take some time off now, will be back to it later but you might be gone by then [18:19:58] we can catch up tomorrow if so, I'll ping here when I'm back. [18:20:27] milimetric, ok I'll be here for a while, but if we do not meet afterwards, lets do it tomorrow [18:32:02] mforns: select * from webrequest_sequence_stats_hourly WHERE webrequest_source = 'upload' and year = 2016 and month = 10 and day = 20 and hour IN (13, 14, 15, 16) ; [18:32:50] (CR) Jonas Kress (WMDE): [C: 1] Use ^ and $ while spliting metric value and type [analytics/statsv] - https://gerrit.wikimedia.org/r/308959 (owner: Addshore) [18:38:31] !log created 0000390-161020124223818-oozie-oozi-C to re-run webrequest-load-check_sequence_statistics-wf-upload-2016-10-20-14&15 (oozie errors) [19:01:35] Analytics-Kanban, EventBus, Wikimedia-Stream, Services (watching): Prepare eventstreams (with KafkaSSE) for deployment - https://phabricator.wikimedia.org/T148779#2732491 (Ottomata) [19:04:37] ottomata: all good? [19:04:57] Hi elukey, we restarted a couple job with mforns :) [19:05:11] Logging off for tonight a-team :) [19:05:20] bye joal! [19:05:47] nice thank you guys! I was about to do them :) [19:08:00] all seems good (kafka, vk, etc..) [19:08:02] so logging off [19:08:06] elukey: yeah [19:08:07] all seems good [19:08:09] thanks [19:08:10] ttyt [19:08:14] hola! [19:08:20] I thought you were akf [19:08:22] *afk [19:08:32] super, ttyt :) [19:08:54] got back! [19:08:58] no longer afk! [19:09:00] I am [19:09:07] otk [19:09:10] (on the keyboard?) [19:09:11] hehe [19:09:13] ok laters! [19:19:13] (PS3) Nuria: Adding several wikis to Pageview whitelist [analytics/refinery] - https://gerrit.wikimedia.org/r/316838 (https://phabricator.wikimedia.org/T130249) [19:20:09] (CR) Nuria: "Sorry, again, corrected spaces to be tabs." [analytics/refinery] - https://gerrit.wikimedia.org/r/316838 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [19:28:42] (PS4) Nuria: Adding several wikis to Pageview whitelist [analytics/refinery] - https://gerrit.wikimedia.org/r/316838 (https://phabricator.wikimedia.org/T130249) [19:44:18] Analytics-Kanban, Patch-For-Review: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2732600 (Nuria) a:Nuria [19:54:07] Analytics-Kanban, EventBus, Wikimedia-Stream, Services (watching): Prepare eventstreams (with KafkaSSE) for deployment - https://phabricator.wikimedia.org/T148779#2732491 (mobrovac) > - create mediawiki/services/eventstreams/deploy repo with scap config and node_modules (figure out how services b... [19:56:11] Analytics-Kanban, EventBus, Wikimedia-Stream, Services (watching): Prepare eventstreams (with KafkaSSE) for deployment - https://phabricator.wikimedia.org/T148779#2732491 (Pchelolo) >>! In T148779#2732642, @mobrovac wrote: >> - create mediawiki/services/eventstreams/deploy repo with scap config a... [19:57:57] Analytics-Kanban, EventBus, Wikimedia-Stream, Services (watching), User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2732665 (Ottomata) Thanks for the feedback everyone. At this time, we are moving forward with SSE. We can always revisit possible websocket supp... [20:01:23] (PS2) Nuria: Enhancing regex to support pageviews to non-knowledge wikis [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) [20:02:12] (CR) Nuria: Enhancing regex to support pageviews to non-knowledge wikis (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [20:08:06] hi, what's the command to do dynamic listening of the hive-destined event pipe? e.g. if i want to see requests to the scb cluster / graphoid? [20:08:20] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2585510 (JMinor) Has anyone actually looked at using the community developed criteria I hav... [20:08:26] milimetric or ottomata ^^ ? [20:09:09] yurik: not sure what you are asking [20:09:18] you want to see live logs? [20:09:22] ottomata, yep [20:09:24] webrequest logs as they come in? [20:09:28] live events as they come [20:09:33] yep, to the scb cluster [20:09:46] hmm, there's nothing to get you just scb cluster [20:09:47] i'm trying to figure out what is causing all the extra traffic to the graphoid [20:09:48] but you can grep [20:09:52] yep [20:09:56] what cache cluster is it? [20:09:59] misc? [20:10:01] yep [20:10:36] i know there is a magic command, but i couldn't find it [20:10:39] kafkcat -C -b kafka1012.eqiad.wmnet:9092 -t webrequest_text [20:10:45] ottomata, on stat1002? [20:10:46] oh [20:10:47] do misc [20:10:48] yes [20:10:54] do misc what? [20:10:54] kafkcat -C -b kafka1012.eqiad.wmnet:9092 -t webrequest_misc [20:10:57] not text [20:10:58] ah [20:11:01] excellent, thanks! [20:11:14] ack [20:11:15] kafkacat [20:11:19] kafkacat -C -b kafka1012.eqiad.wmnet:9092 -t webrequest_misc [20:11:23] typos, sorry [20:12:11] thanks :) [20:12:33] sho thang [20:15:42] ottomata, sorry to bug again - i'm trying to catch these items, and i don't see them --- https://www.mediawiki.org/api/rest_v1/page/graph/png/Extension%3AGraph%2FDemo/2221877/f6ba41370b2fd4e44cece2d4b5b6720795f8dbdb.png [20:15:58] yurik: are you grepping for just that? [20:16:08] the uri parts are split up into different fields [20:16:18] ottomata, i'm grepping for -E '[0-9]\.png' [20:16:25] aye [20:16:30] hm [20:16:35] i dunno which cache cluster rest api goes to [20:16:36] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2732755 (Mholloway) @JMinor Where are you finding info on mobile vs. desktop views? Are yo... [20:16:38] i doubt its misc [20:16:45] i thought it was ... [20:16:50] maybe it is text now, let me try [20:16:50] mobrovac: ^ do you know? [20:17:50] yep, i'm getting it [20:17:53] thanks, it is text [20:17:55] should be in text [20:18:24] thanks [20:20:56] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2732814 (JMinor) I've been using the pageviews web tool to check things that look suspcious... [20:28:41] yurik: not sure if you saw this as well but [0-9]\.png wouldn't match the ".....b.png" uri you had above [20:29:07] probably want like [a-z0-9]\.png or something [20:29:10] milimetric, yeah, already fixed that, but it was the text cluster that was the culprit :) [20:29:17] k [20:29:18] and no, its a-f0-9 :) [20:29:57] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2732861 (Nuria) >This is a simple metric which looks at the ratio of mobile vs. desktop vie... [20:29:58] milimetric, a question for you though - do we have any good way of copying hive -> grafana (summarized of course) ? [20:30:10] e.g. where we would provide queries, and all the plumbing is already done? [20:30:12] yurik: like, preiodically? [20:30:17] our analytics team is looking at it [20:30:17] yep [20:30:21] yurik: we do that with spark [20:30:23] yep [20:30:25] yurik: not hive [20:30:38] the request data - whatever the storage :) [20:30:39] nuria: it's ok with hive, and probably easier with reportupdater [20:31:01] it's not real time you want, right yurik? You'd want like daily... weekly? [20:31:10] milimetric: with report updater you can easily get the results from hive, sure, send them to graphana is an additional step [20:31:35] milimetric, well, my interest is somewhat in between - e.g. minute/hourly [20:31:48] also, it would be great to be able to backfill stuff [20:31:48] oh ok, hm [20:32:05] yurik: maybe the best thing would be an hourly oozie job? [20:32:13] i'm adding this to the list: https://etherpad.wikimedia.org/p/stream-processing [20:32:16] because request data is only available hourly, so you won't get lower than that [20:32:17] yurik: what type of data, you know that we do not have real time anything [20:32:20] ottomata: yep, +1 [20:32:36] yurik: and there is no guarantee the last hour is processed [20:32:39] yurik: what kind of thing do you want to graph? [20:32:40] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2732864 (Mholloway) Thanks, @JMinor. Our main concern behind this task is that, practicall... [20:32:47] milimetric, nuria, we have a script that runs hourly on stat1002 to query tons of tables for select count() ..., and upload it to grafana [20:33:01] but it would be great to combine it with some good data from webrequests [20:33:06] as well as eventlogging [20:33:10] What type of data? [20:33:15] requests [20:33:21] webreqs [20:33:32] and eventlogging data [20:33:47] ok, it might help if only one of us walks you through this [20:33:49] :) [20:33:50] basically - custom queries that our analytics would come up with, going into grafana [20:34:03] hehe, actually is there a doc page somewhere? [20:34:05] yep [20:34:13] i would point our analysts at it [20:34:22] not explaining this specific thing [20:34:29] but yes, there's lots of docs [20:34:47] I think the main message to get across is that webrequest is not made for real time, it is at least one hour behind and possibly more [20:35:02] bearloga, ^ [20:35:04] so the key is that data right now is ingested into the wmf.webrequest table "when an hour is ready" [20:35:11] nuria: yep, I got it [20:35:21] milimetric: jaja i know you know [20:35:43] so yurik / bearloga: that hour may be delayed if there are problems processing [20:35:45] milimetric, its ok for the data to be updated with a delay, as long as its resolution is an hour or more [20:35:52] so if you set up a cron with a hive script, you're not guaranteed to have proper data [20:36:07] yurik: ok, so then the best tool for this, since it's batch and not real-time is oozie [20:36:15] chelsyx, ^^ also [20:36:16] oozie will trigger jobs based on when hours are available [20:36:37] yurik / chelsyx / bearloga: so the question is, do you need data from the webrequest table or will pageview_hourly do? [20:36:48] milimetric, right, but do you have all the plumbing to push that data to grafana storage? [20:36:51] https://wikitech.wikimedia.org/wiki/Analytics/Data explains the difference [20:37:05] yurik: we have pieces, and oozie is the major piece [20:37:16] Yurik: right, I would be more interested to understand what is the end goal? What do we hope to learn from the data? [20:37:27] yurik: and based on that advice on best solution [20:37:37] also, we do a little bit of spark -> graphite stuff and use that to set data timestamp in graphite [20:37:40] oozie will execute arbitrary scripts and we have scripts that write stats to graphite [20:37:41] milimetric, pageviews wouldn't have the needed data - i want things like requests to Graphoid or kartotherian service, etc [20:37:43] so graphs that are delayed will be properly backfilled [20:37:49] but we don't do a lot of that, so it isn't very streamlined [20:37:54] i did this dashboard - https://grafana.wikimedia.org/dashboard/db/interactive-team-kpi [20:37:57] its code [20:38:33] it has some useful data for us, but not as much as we would like. So we need to add some more magical webrequest and eventlogging queries to feed into grafana [20:38:41] milimetric yurik nuria chelsyx: yeah, maps is interested in computing usage via webrequest and having that data available in grafana [20:39:02] bearloga: nothing thus far implies real time or even less resolution than daily [20:39:06] bearloga, its interactive, thank you very much :)))))) [20:39:27] right, yurik, why hourly? [20:39:35] yurik: https://github.com/wikimedia/analytics-refinery-source/blob/699614fabaf0d19f219c5b594a184422110ae8a3/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala#L80 [20:39:42] bearloga: usage is affected by daily patterns thus daily resolution seems that would be the lowest you want to go [20:39:51] nuria, it would be good to have higher res than daily, but hourly allows us to track service load times through the day [20:40:05] bearloga: load times is not a metrics of usage [20:40:12] yurik: but rather performance [20:40:17] yurik: but how do you get load time from webrequest? [20:40:20] yurik: those are two different use cases [20:40:29] i never said load time - i want server load [20:40:36] how many requests server gets through the day [20:40:41] hourly resolution is good enough [20:40:47] nuria: yurik I think he's talking about "time during the day that have higher usage" type of load [20:40:52] yep [20:40:54] thx :) [20:40:56] I see, so you match request count against server performance [20:41:03] that too [20:41:32] the idea here is not the specifics of the actual data, but the overall approach of how our analytics team can extract the data we need and pipe it into grafana, without reinventing the wheel :) [20:41:49] yurik: but actually teh specifics of teh case matter a lot [20:41:59] yurik: cause we have like 3 different ways to do that [20:42:06] ok, so there's no built-in for this, but it is something we talked about. I don't really want to make a one-off for everyone doing this kind of filtering. Because everyone would then touch every webrequest (and that's a lot of processing) [20:42:26] two basic data sources at this point being webrequests (but not page reqs - subpage resources might be needed), as well as eventlogging mysql tables [20:42:27] instead, I'd like a central place where people register a filter and an action to take with the matching data [20:42:31] nuria milimetric: yurik basically wants to have http://discovery.wmflabs.org/maps/#tiles_total_by_zoom (and other metrics that we calculate from our hive queries of webrequest data on a daily basis) but in grafana [20:43:30] right, but it might be "show me per-country distribution of the map usage data) [20:43:39] yurik: we'll figure out the general part, but you can pick which way you prefer: spark -> grafana or oozie -> custom script -> grafana [20:43:40] or - give me unique user count per country [20:44:27] milimetric, not sure of the difference. As long as we can provide a custom filter / query / where clause with aggregation, its good enough :) [20:44:27] yurik / bearloga: the spark way would be something like this: https://github.com/wikimedia/analytics-refinery-source/blob/699614fabaf0d19f219c5b594a184422110ae8a3/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala#L80 [20:44:32] yurik: most of that type of data can be calculated daily though, i see little value in that type of fine grained data being at a lower resolution. I can see # of requests but that's it [20:44:47] well oozie would probably be in there either way [20:44:53] oozie -> spark ->grafana [20:44:59] oozie -> custom thing -> grafana [20:45:00] right , i was going to say that [20:45:00] and the oozie way would be a workflow like this: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/workflow.xml [20:45:03] s/grafana/graphite/ [20:45:19] ottomata: but isn't that filter already ooziefied? [20:45:20] milimetric ottomata: thanks for the links! [20:45:25] milimetric: no [20:45:31] how's it work? [20:45:49] the RESTBaseMetrics one [20:45:49] milimetric, yurik, bearloga an example that uses oozie to send graphana wikidatametrics: [20:46:08] yurik: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala [20:46:24] but again, that type of data sounds daily to me [20:46:28] bearloga, are you taking notes? i just hope my irc logs won't be lost :)( [20:46:39] it's all documented [20:46:51] yurik: IRC Cloud :) [20:47:01] yurik: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark [20:47:12] yurik: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie [20:47:27] nuria, grafana would auto-aggregate data as it gets older, so if the original data comes from hourly slobs, it should be ok. What about eventlogging/ [20:47:39] nuria / ottomata: but shouldn't we combine all these filters? [20:47:41] is there a good solution for the eventlogging stream/mysql tables source? [20:47:51] yurik: but hitting teh cluster at that resolution when it is not needed should not be necessary [20:47:55] and have a class with a list of the ones the oozie job is applying? [20:48:10] yurik: , and oozie job that runs the wikidata article placeholder spark job [20:48:10] https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata/articleplaceholder_metrics [20:48:23] milimetric: whatchamean? [20:48:32] * yurik feels lost with the tech names :( [20:48:36] hahah [20:48:46] so, we have oozie set up to run each of these filters individually [20:48:54] yurik: https://wikitech.wikimedia.org/wiki/Analytics/Cluster#Glossary [20:49:06] and each one pulls all the data for an hour, filters it with a where clause, and sends it to grafana [20:49:08] ah, milimetric you mean you want to consolodate the oozie jobs [20:49:11] yep [20:49:14] and the scala [20:49:19] milimetric: i don't know much about the scala code [20:49:22] yurik: but also be aware of graphana's limitations. Showing per project data is not one of its strengths, on my opinion wikidata's graphanas dashboards are not readable: https://grafana.wikimedia.org/dashboard/db/article-placeholder [20:49:24] if it is that simple, then taht would make a lot of sense [20:49:42] ottomata: yeah, it should be pretty simple and then adding one of these would just be a copy/paste of another filter and only modify scala code [20:49:47] well, plus a -source deploy [20:50:12] nuria, grafana shows per project stuff just fine imo - its all about how you set it up. See the bottom graphs at https://grafana.wikimedia.org/dashboard/db/interactive-team-kpi [20:50:17] milimetric: sounds like you want a generic spark job that is paramaterizable: data source, fitlers, etc. and then graphite output keys? soethign like that [20:50:29] yep [20:50:40] (heheh sounds a little like reportupdater :p ) [20:50:44] yurik: if you have 10 project s sure, if you have 100, no , not really [20:51:01] yeah, but we agreed to not do webrequest or realtime stuff with reportupdater [20:51:07] nuria, i do have 100s, i just show top 10 :) [20:51:09] yeah [20:51:11] makes sense [20:51:45] and of course the aggregation of "top 10 + others" can be done by the query as well [20:52:05] yurik: ya, i meant if you want to see all is too much on that ui [20:52:13] yurik: but nevermind that [20:52:24] nuria: what do you think? I can make a task to clean this up and make a single place where we do these kinds of filters [20:52:28] nuria, sure, but again - its all about how you set up your dashboard (plus with an option to drill down works well) [20:52:30] oozie + spark -> grafana [20:53:05] all i care about is a place where i can give you a big sql/hive/... query to get extractly the timeseries i want :) [20:53:14] yurik: but so you'd have to still collect all data and filter out after it gets to grafana... seems somewhat wasteful if 95% is never looked at [20:53:32] that's why i need your guidelines :) [20:53:32] wow that glossary text is old [20:53:34] updating some [20:54:21] hm milimetric it soudns lke yurik's use cases might be more complicated than just applying a filter for an hour and counting [20:54:26] for many data series, we do want at least hourly resolution. And we can totally process it AFTER the hour or even AFTER the day - but the resolution of the grafana data should be at least per hour [20:54:28] milimetric: i agree [20:54:40] milimetric: with otto [20:54:43] yurik: are you talking about joining eventlogging + webrequest ? [20:54:51] ottomata, no [20:54:55] oh ok [20:54:57] he wants those separately [20:54:59] then maybe ok [20:55:02] the joining might happen on the dashboard [20:55:04] from webrequest he only wants a count [20:55:07] e.g. show two timeseries on the same graph [20:55:12] milimetric: i would let yurik try a bit at spark and consolidate once use cases become more clear [20:55:12] if you just want aggregation with filter and counting [20:55:24] that should be easy to make a generic spark job and manage it all in one oozie bundle [20:55:34] nuria: seems like a bad user experience though, we should be the ones consolidating that [20:55:35] nuria: not a bad idea [20:55:47] I give you that priority on this is lower than the other stuff this quarter [20:55:53] ottomata, example resulting timeseries: show me the per-country unique user count map usage [20:56:03] well, milimetric it might be good to get yurik into spark so he knows what he can do there [20:56:18] sure, I'm cool with that [20:56:31] ottomata, i think bearloga knows much more about data querying than i do by now :) [20:56:35] as well as chelsyx [20:56:36] milimetric: agreed, but doing this is significant work that gets in the way of other work we are set up to do and there is a not-sohard workarround [20:56:47] yurik: if you start banging your head into your desk, let us know and we'll attempt to prioritize cleaning this up [20:56:52] I would also expect our analysts to be able to write spark [20:57:11] nuria: maybe python-spark, but I think scala is unfair [20:57:19] nawwwwwwww [20:57:22] milimetric: scala is fun! [20:57:25] no [20:57:27] but ja, python is a better place to start [20:57:33] :) [20:57:38] bearloga: how do you fill about scala/spark? [20:57:39] scala is a bad language [20:57:41] my biggest concern - I do NOT want our analysts or devs to spend time writing custom grafana upload code, or cron job management, or job monitoring :) [20:57:52] hha [20:58:02] yurik: they do not have to do that [20:58:06] yurik: nope, we take care of that stuff, and the grafana code is super easy [20:58:08] yeah yurik we can help with all of that. at the very least you can copy what is in analyics/refinery [20:58:08] yurik: that is what oozie is for [20:58:13] so far seems like we have been doing that :) [20:58:23] see our dashboards - it has all that AFAIK [20:58:25] yurik: that's your fault for not asking us :) [20:58:28] and we can deploy another job [20:58:29] :-P [20:58:43] yurik: agree with milimetric , you brought it up on yourselves [20:58:43] milimetric: wants to do some abstraction to make adding more jobs like this easier in the future [20:58:45] which would be nice [20:58:48] milimetric, users not using your stuff is always your fault - its a usability issue :)))) [20:58:48] but we might not have it for now [20:59:09] yurik: agree with that , but our usability had improved 100% [20:59:15] yeah, oozie is not the easiest to use.... [20:59:16] haha [20:59:19] yurik: by the time those dashboards came to exist [20:59:21] yurik: it would be true if you tried to use it, failed, and told us [20:59:30] yurik: but nevermind that [20:59:33] haha [20:59:33] anyway, now we're talking :) [20:59:44] and the customer's always right [20:59:53] so like I say, try what andrew suggested and if it's too hard let us know [20:59:55] milimetric, not exactly - if our analysts weren't even aware of its existance, we have failed cross-team information sharing :) [21:00:02] oh yeah, +100 [21:00:10] cross-team info sharing is 'da worst [21:00:23] I think I did like 3 tech talks on dashiki and 0 people know it exists [21:01:03] yurik: but that is a wmf problem yurik, our team spends tons of time evangelizing so we hope that anyone interested in data access comes to talk to us before rolling out its own idea [21:01:20] yep. bearloga, chelsyx, ottomata, milimetric, nuria - how about a sync up half an hour hangout to go over the reqs and implementation and general presentations? this way discovery would present what we created, you show what you have (a short demo), and we get to be on the same page [21:01:29] yurik: it could be that our answer is that is real easy what you wnat to do or taht is real hard [21:01:31] possibly an hour [21:01:40] :) [21:02:00] i feel that without a proper demo of these techs, 5-way talking on IRC is only getting a bit more confusing :) [21:02:03] yurik: sure, anytime [21:02:13] esp when we also discuss how some of these techs should be improved [21:02:28] ok, need to logoff now. please set up something for next week [21:02:43] I'm happy to show you this existing workflow now if you want, or next week [21:02:48] +1 [21:02:52] you prob don't need me though :) [21:02:53] bearloga, chelsyx, i feel like it would be ideal for you two to drive it, as you would stand the most to gain from it? [21:02:57] or maybe you do [21:03:02] so i can evangelize stream processing [21:03:14] ottomata, exactly, we would love everyone to present their stuff [21:03:18] yurik chelsyx ottomata milimetric nuria: +1 to a healthy hangout of discussions and demos. -1 to blaming each other and saying "you failed" and "it's your fault" [21:03:25] i also think we might want a youtube recording of sorts [21:03:37] ottomata: do you mean you'd wanna talk about using flink here? Or the spark example you pasted? [21:03:44] bearloga, it wasn't blaming (other than myself :))) [21:04:01] bearloga: sorry if you lost context, I'm smiling and kidding all the time, that wasn't meant to be harsh in any way [21:04:11] I just assumed yurik knew I was messing around [21:04:16] i think we haven't done enough to cross-team communicate, but that's why we are talking here :) [21:04:35] yeah, i think i have known you all long enough to take anyone seriously :))) [21:04:48] haha [21:04:52] good, good [21:05:07] yep, bearloga / yurik: feel free to put something on my calendar next week, you can make ottomata and nuria optional [21:05:12] oki, how about this - next week, everyone does their own demo, we get in sync to what's needed, and reduce tech dups :) [21:05:21] yep [21:05:30] bearloga, chelsyx, how does that sound on your side? [21:06:02] * yurik still blames US politics for messing up this discussion [21:06:20] ottomata: where's the oozie job for RESTBaseMetrics? [21:06:27] lol [21:06:30] hahaah nonono [21:06:39] not time to evangelize flink [21:06:44] ok [21:06:46] phew [21:06:46] yurik: you do'nt want to here waht I say about stream processing [21:06:57] because we won't have it for like a year maybe [21:07:04] HM, but you can do spark streaming now [21:07:07] for analytics stuff [21:07:09] maybe! :D [21:07:09] haha [21:07:57] hmmmmm you know, we could, as a first deploy of flink, do this general filtering thing [21:08:30] consume webrequest from kafka, filter, produce a topic that can be consumed and sent straight to grafana [21:08:47] seems like a good base case [21:09:20] (and everyone would push different grafana metrics to the same topic) [21:09:39] milimetric: kinda like statsv :p [21:10:00] i mean...eventlogging can do that (just not at webrequest scale) [21:10:04] buut yayayay [21:10:35] milimetric: that sounds actually a little like statsd [21:10:40] using kafka as a buffer, but ja [21:10:41] something [21:11:00] consuming and filtering webrequests isn't cheap, but def doable [21:11:07] yurik: sounds good [21:11:27] ottomata, i would love to hear about it as part of your demo, so that everyone knows what to expect in a years time :) [21:12:23] bearloga, want to schedule it, or should i? [21:14:50] ottomata: I mean, not cheap sure, but we're doing it in it looks like 3 separate places already for this filtering thing and about to add a 4th. So replacing 4 with 1 is a win [21:17:35] ja [21:17:40] agree [23:04:23] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2733377 (JMinor) Thanks. Yes, my whole push here is exactly to have a heuristic that is not... [23:12:28] Analytics-Kanban, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Spike: [Feed] Establish criteria for blacklisting likely bot-inflated most-read articles - https://phabricator.wikimedia.org/T143990#2733395 (JMinor) >>! In T143990#2732861, @Nuria wrote: >>This is a simple metric which look... [23:36:21] Analytics, Reading-Web-Backlog: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2733451 (Aklapper)