[00:01:56] (PS1) Yurik: https by default, bug fix, saferun tile output [analytics/zero-sms] - https://gerrit.wikimedia.org/r/247763 [00:02:17] (CR) Yurik: [C: 2 V: 2] https by default, bug fix, saferun tile output [analytics/zero-sms] - https://gerrit.wikimedia.org/r/247763 (owner: Yurik) [00:18:51] Analytics, Continuous-Integration-Config, WMDE-Analytics-Engineering, Wikidata: Add basic jenkins linting to analytics-limn-wikidata-data - https://phabricator.wikimedia.org/T116007#1740759 (Dzahn) [00:19:05] Analytics, Continuous-Integration-Config, WMDE-Analytics-Engineering, Wikidata: Add basic jenkins linting to analytics-limn-wikidata-data - https://phabricator.wikimedia.org/T116007#1738262 (Dzahn) [00:39:56] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [00:41:37] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [01:24:31] Analytics-Backlog: Create deb package for Burrow - https://phabricator.wikimedia.org/T116084#1740907 (madhuvishy) [01:33:44] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. - https://phabricator.wikimedia.org/T116097#1740916 (kevinator) I think we should look at a 5 year horizon. How much will we need to spend when the hardware goes out of warranty and needs to be replaced. Keep in mind t... [01:49:03] Analytics-Backlog: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1740922 (madhuvishy) NEW [10:05:38] hi a-team :] [10:07:00] (PS1) Christopher Johnson (WMDE): adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 [10:13:41] (PS2) Christopher Johnson (WMDE): adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param fixes config path in bulk_sparql script [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 [10:15:03] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param fixes config path in bulk_sparql scri [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 (owner: Christopher Johnson (WMDE)) [10:43:55] joal, yt? [11:09:50] (PS1) Christopher Johnson (WMDE): adds Wikimedia Categories to home view [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247808 [11:09:52] (PS1) Christopher Johnson (WMDE): fixed untracked files [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247809 [11:11:50] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] adds Wikimedia Categories to home view [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247808 (owner: Christopher Johnson (WMDE)) [11:12:14] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] fixed untracked files [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247809 (owner: Christopher Johnson (WMDE)) [11:34:15] (PS3) Mforns: Add oozie job to compute browser usage reports [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) [11:37:45] hey mforns [11:37:49] Went to lunch :) [11:37:52] hi joal! [11:38:02] wassup? [11:38:31] I was going to try and test the eventlogging patch in beta cluster [11:38:44] okeyyy [11:39:01] do you want to this too? [11:39:08] sure ! [11:39:20] Do you give me 5 mins before starting ? [11:39:26] Then fully up with you :) [11:39:34] of course [11:39:42] k brb [11:48:50] mforns: to batcave :) [11:48:54] joal, omw [12:40:13] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [12:41:39] (PS1) Addshore: Improve README [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247824 [12:43:44] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [13:03:55] (CR) Addshore: [C: 2 V: 2] Improve README [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247824 (owner: Addshore) [13:04:20] ottomata, yt? [13:08:13] hiya yup [13:08:22] milimetric: sorry i missed your text yesterday, [13:08:39] sok! [13:08:46] after hours, didn't expect you to be around [13:08:55] just checking in case you were like staring at a wall and would rather deploy instead [13:10:05] Quick query again all! How longs does data stay in graphite? Is there a way to make it stay longer? and if I understand correctly anyone with cluster access can just throw more data into graphite? [13:10:33] addshore: it stays for a while but starts losing resolution [13:10:40] ahh okay [13:10:41] I'm not sure when it loses what resolution level [13:10:51] thats good to know! :) [13:10:53] mforns_lunch: hiya [13:10:58] and yeah, we use statsd to send data to graphite, here's an example for event logging, one sec [13:11:33] milimetric any idea if at some point the resolution becomes 0? or just less and less but never 0? [13:11:36] this is where we create the client: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L250 [13:12:06] and here's the puppet that configures that host: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/eventlogging.pp#L38 [13:12:34] addshore: I don't think it ever becomes zero... I think yearly at worst, but let me check back, I'm curious [13:12:42] awesome! [13:13:27] for a bit of background, im working on tracking wikidata metrics, and currently I have a bunch of scripts getting data, and storing in mysql and outputing as tsv. I'm wondering if it would make more sense to kill the sql and chuck it all in graphite [13:13:59] though, I feel I need to know more about graphite before I di that ;) [13:14:33] addshore: as long as you're just counting things, then I think graphite's great. By the way, did you know you can use grafana to hit the same data and make your graphs look much prettier? [13:14:45] here's the example for EL: https://grafana.wikimedia.org/dashboard/db/eventlogging [13:14:51] milimetric: yup! and basically everything is counting! [13:15:00] welll :) [13:15:10] not quite, but a lot of things are [13:15:12] Currently we have wdm.wmflabs.org/wdm/ but personally I would prefer to use grafana [13:15:15] http://wdm.wmflabs.org/wdm/ [13:15:20] using shiney [13:15:46] yeah, I like shiney, but if you're just counting it's easier to use grafana [13:15:58] of course, it's also easier to just keep what you have since you already did all the work [13:16:05] well, I would say most of it is just counting, and the some other calculating [13:16:08] but I get it if it makes you feel dirty :) [13:16:33] can you add legacy data / data with old timestamps into graphite? [13:16:42] or just live / now data? [13:19:05] (CR) Joal: [C: -1] "Comments inline :)" (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) (owner: Mforns) [13:19:39] addshore: I've never done that, I briefly looked through the docs and couldn't find a way [13:19:49] okay! [13:19:59] right, time for some more reading I think :) [13:22:07] addshore: Madhu used Spark to send data to Graphite, and that data was historical :) [13:22:16] ooooohhhh [13:22:34] right, that's good, because that means I dont have to try and rush into a decision ;) [13:23:04] How about a way to easily export all data stored in graphite for a metric? say to a tsv? [13:23:08] I think the message sent to graphite contains time identification [13:23:36] hm, data export from graphite I don't really know [13:23:48] I think graphite provides an API to get data [13:24:02] okay, yep, going to go and trawl through the docs then! :) [13:24:13] thanks all! :) as always! [13:24:13] Meaning, the charts are displayed from getting data through an API, so there must be a way to get your hands over the raw data [13:24:34] np addshore [13:24:52] milimetric: Do you want me to test your oozie stuff ? [13:25:12] It looks good, but if it makes feel better, I'll test it :) [13:25:16] joal: did you see the oozie tests I talked about in the comments? [13:25:23] or did you mean you want to re-test because of the changes? [13:26:08] The data sent must be in the following format: , Cool! [13:26:31] re-reading oozie test [13:27:51] seems fine to me milimetric, I assume you have double checked the exported files in hdfs ? [13:27:59] milimetric: I found the retention info! https://github.com/wikimedia/operations-puppet/blob/2e1a45c0f89082564e3e201ebfc0cdd0a1db5871/manifests/role/graphite.pp#L38 [13:28:40] joal: yes, I head and tail-ed both of them [13:28:44] and they looked legit [13:28:44] so it looks like after a year it will drop out [13:29:08] addshore: aha, oh! that makes the EL data make sense now [13:29:24] I wasn't seeing anything beyond a year ago and was sure we had collected before that [13:29:37] but in theory graphite can keep as much as you have storage [13:30:13] how much storage do we have? :P [13:30:26] we could easily add 1d:5y and 1w:100y if ops is ok with it [13:30:34] or something like that [13:30:56] I might go and propose a patch for that (or something like that). [13:31:25] as if we have that (or something similar) then graphite would be perfect for 99% of the stuff people want me to track.. [13:31:30] makes sense to me except i wish it could be customizable [13:31:38] it sucks that if you do that it'll start keeping data for everything [13:32:13] yeh, well, it can match regexes [13:32:59] so in theory I could just have longer retention for my stuff [13:34:40] (CR) Joal: [C: 2] "Leaving as is for merge synchro, good to go !" [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:35:04] joal: that can be merged, it's next on the deployment plan [13:35:10] as long as the deployment plan makes senes [13:35:13] *sense [13:35:49] ottomata: ping me when you're ready to deploy AQS [13:35:56] I made this etherpad: https://etherpad.wikimedia.org/p/analytics-deploy-aqs [13:36:11] I did all the steps up to the deploy section [13:36:20] milimetric: about deployment, have the aggregator changed been merged ? [13:36:42] joal: yeah, you merged them yesterday when you +2-ed it [13:36:48] milimetric: lets do it [13:36:52] I left you a comment about it, some repos automatically merge when you +2 [13:36:54] got some updated docuemntations? [13:37:01] (CR) Ottomata: [C: 1] Archive hourly pageviews in legacy format [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:37:04] that etherpad I linked to ^ [13:37:09] oh thanks [13:37:13] (CR) Joal: [V: 2] "Good for me, merging." [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:37:20] hmmm, milimetric, really ? [13:37:24] ottomata: I did all the steps, we just have to do the "deploy" section [13:37:28] joal: yes, really :) [13:37:35] arf, memory [13:37:47] milimetric: i think I don't need to do the update steps, right? [13:37:50] k [13:38:03] hahaha, if I was worried about missing things I'd be under a desk with a tinfoil hat rocking back and forth [13:38:18] ottomata: right, so just the --check and the real one if the check's ok [13:38:24] k [13:39:11] ok all 3 hosts at check look ok [13:39:18] !log deploying aqs [13:39:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:39:53] dan, shoudl we put this at https://wikitech.wikimedia.org/wiki/Analytics/AQS [13:39:54] ? [13:39:58] your etherpad stuff? [13:40:04] milimetric: deploy done. [13:40:30] milimetric: I left a comment saying, not merging for synchro, and it merged nonetheless ! [13:40:30] so FYI you can get raw data of graphs as json, https://graphite.wikimedia.org/render/?target=jobrunner.pop.wikibase-addUsagesForPage.ok.mw1004.count&format=json , still hunting for just the data [13:40:38] I assumed jenkins to read my comments, pfff [13:41:58] stupid jenkins [13:42:18] ottomata: hm... I wonder we probably have to restart restbase now, right? [13:42:27] milimetric: shall we deploy and restart the coordinators ? [13:42:31] I was just checking and it doesn't seem to be serving at the new address [13:42:39] joal: yes, after aqs [13:42:41] milimetric: it should restart it as part of deploy [13:42:53] did it say it did that properly in the output? [13:43:01] ha, ithink so, buuuut i have closed the window [13:43:04] doesn't hurt to deploy again, ja? [13:43:05] milimetric: hm, thinking of that, there is a big bunch of changes to deploy, let's wait for nuria [13:43:35] milimetric: i'm running deploy again to see [13:43:51] hm... something's weird, it doesn't have my change in the source [13:43:52] oh, now it says skipping restart because nothing changed [13:44:04] mmhm.... [13:44:07] hm.... [13:44:08] oh! [13:44:17] hm, there is a restart task, shall I run it? [13:44:25] wait, no, it's not pushing my changes [13:44:42] ah, ottomata, my bad, I missed a step: [13:44:42] https://gerrit.wikimedia.org/r/#/c/247726/ [13:44:45] that needs to be merged [13:47:21] ok so deploy again? [13:47:24] it is merged now, ja? [13:47:27] ok, ottomata, sorry, I updated the etherpad, yes, ansible again [13:47:33] yes, merged now [13:47:41] we should get that repo to ping us here in chat [13:48:05] i think you should get deploy perms [13:48:07] :) [13:48:51] yes, that would be nice [13:48:58] especially since this has so many moving parts [13:49:01] make a ticket? [13:49:08] you and joseph [13:49:11] good idea, doing [13:49:16] thx guys [13:49:18] it might mean i need sudo, which is scary [13:49:21] i think you need separate tickets for that [13:49:28] hopefully you only need sudo to a certain user [13:49:38] ok, looks good milimetric it did restart restbase [13:51:53] * milimetric checks [13:55:37] guys I'm having this problem that's driving me crazy [13:56:13] if I press the up arrow in my shell to see my history, it's coming up distorted, chopped in half with part of the command mangled [13:56:27] anyone get that and know how to fix it? [13:57:54] haha nope [13:58:18] happens to me sometime milimetric [13:58:25] never managed to have it fixed :( [14:00:12] it's happening every time and it's like ... maddening [14:00:26] I have to type out the whole aqs url every time [14:01:49] milimetric: so retaining everything in graphite at a daily scale for 50 years (after the current retention) would result in 170% of the current storage being used :P [14:02:01] just to put a random number out there ;) [14:02:20] yeah, but after a year is daily resolution even worth it? [14:02:41] I'd go to weekly resolution personally, maybe after 1 year or maybe after 3 years [14:02:49] depends on what data your recording :) [14:04:24] well, daily for 10 years and weekly for 50 would just be 124% of the current ammoutn [14:05:35] but, again, it still might be better to put the metrics I want in something that can easily be matched by a regex and then have another resolution applied [14:06:15] that probably has the best chance of being merged [14:06:51] right, I'll think about that then! and try and tihnk of a place / name start that other could potentially use too [14:08:01] ottomata: thanks, all looks good [14:08:09] just took me like 30 minutes to check as I fought with my CLI [14:08:10] ughhh [14:08:11] :) [14:08:11] ottomata: there was loss in today's report for raw webrequestrs [14:08:20] Anything coming to mind ? [14:09:08] oooo, interesting, This retention is set at the time the first metric is sent, so changing the config in puppet wotn alter the retention of any metrics already stored [14:10:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [14:11:15] joal: wanna do the coordinators now? [14:12:18] joal i saw that, the 30% in the one mobile hour! not sure, no. [14:12:48] milimetric: if we wait for nuria there'll the cassandra to be merged and deployed in the mean time [14:12:52] milimetric: is that ok ? [14:13:03] ottomata: will have a deeper look [14:13:06] mmmmmmmmmmm okkkk :) [14:13:20] mforns_lunch: lemme know when you're back, we can do the EL stuff? [14:13:27] milimetric: you are too nice to me, remember we were planning to be better at saying no ;) [14:13:52] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:13:54] oh [14:14:03] * milimetric kicks joal in the shin [14:14:11] there [14:14:16] :D [14:32:37] milimetric, hi! [14:32:44] I'm back :] [14:33:22] cool, I was gonna try to restore the missing data using your changes [14:33:38] milimetric, we tried earlier today with Joseph [14:33:41] oh! [14:33:45] sweet [14:33:49] how'd that go? [14:33:51] but we had no sudo permits in deployment-eventlogging03 [14:34:01] since the change to 03 I haven't asked for them [14:34:31] I didn't know I should... I thought permits would be transmitted [14:34:55] hm, weird [14:34:59] but we can test on an1004 [14:35:07] mmm ok [14:35:39] hm, but i suppose that's a pretty beefy machine and it'd be hard to replicate it going OOM [14:36:12] milimetric, I wasn't planning to test it for OOM [14:36:20] just to test it for normal functioning [14:36:27] k, then an04 should be great [14:36:45] I'll checkout the change there and consume a file into sql [14:36:54] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [14:37:03] is it analytics1004.eqiad.wmnet? [14:37:37] ok [14:37:40] batcave? [14:38:15] in which folder are you checking out? [14:38:19] there's nothing in srv [14:38:46] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:44:11] milimetric, can I help, batcave? [14:44:20] pair? [14:44:25] mforns: ah, sorry :) [14:44:26] sure! [14:44:30] cool [14:49:22] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [14:51:03] (PS1) Joal: Correct camus-partition-checker to use hdfs conf [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [14:51:04] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:58:39] In theroy should sanding somehing to statsd with a one liner like "echo "test.foo:100|c" | nc -w 1 -u statsd.eqiad.wmnet 8125" work? :/ [14:59:53] addshore: can you see statsd from the machine you are doing those commands? [15:00:04] addshore: it is not accesible from all places [15:00:11] ahhhh [15:00:29] I cant ping it ;) [15:01:36] addshore: right, cause -if i rememeber this right- there are several statsd hosts you can reach from certain others but not from anywhere, the deatils you have to ask in ops channel [15:01:40] *details [15:01:49] okay! [15:05:32] addshore: btw, handy tcpdump to see whatis going on: tcpdump host statsd.eqiad.wmnet -nnXSs 0 [15:06:43] https://usercontent.irccloud-cdn.com/file/19tZllz5/ [15:06:56] FYI I was looking at sending stuff to statsd from stat1002 [15:17:43] addshore: ah, not possible probably [15:18:03] sad times, wonder if I can find the thing in puppet I would need to poke [15:18:07] mforns: i think we should merge this one: https://gerrit.wikimedia.org/r/#/c/246796/4 [15:18:43] addshore: i do not think 1002 is mean to connect to statsd at all but cc ottomata in case i am wrong [15:19:09] addshore: our reporting through statsd is done in the EL nodes [15:19:21] EL nodes? ;) [15:19:27] addshore: eventlogging [15:19:37] ahh, okay. [15:21:27] hmmm [15:21:41] ottomata: regular loss on hour 10 yesterday for every mobile and maps varnish :( [15:21:44] i think stat1002 should be able to reach statsd [15:21:59] joal: in all clusters? [15:22:06] yessir [15:22:06] esams, etc.? [15:22:30] maps only served from eqiad, but mobile, all clusters [15:22:35] i mean, that sounds like a restart of some kind, buuuuuut, we are ignoreing where seq = 0, right? [15:22:37] ottomata: --^ [15:22:45] ottomata: no seq=0 [15:25:40] ottomata: i'll keep poking and see [15:26:53] joal this is 2015-10-20T10 ja? [15:27:08] yup [15:27:59] mforns: i think we can merge the EL sleep patch right? [15:28:32] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [15:29:53] addshore: you need to run tcpdump cpommand as root [15:30:03] addshore: but again you probably do not have root on 1002 [15:30:54] addshore: you CAN ping statsd from 1002: ping statsd.eqiad.wmnet [15:31:22] addshore: ah , no, wait [15:31:51] :P [15:32:10] I tried tel-netting on the port too, but nothing [15:33:22] oh nuria apparently it is working! https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445441534.023&target=test.bash.foo.rate&target=test.bash.foo.lower [15:34:17] hooray! [15:34:56] very strange joal, i don't see any varnishkafka delivery errors... [15:35:06] rhmmm [15:35:06] i do see that camus seemed to have a short lag (i think) at about that time [15:35:14] the number of bytes out of kafka was flat for a bit then [15:35:43] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [15:36:15] ottomata: let's rerun the load for this guy, and see if there is any diff [15:36:18] right ? [15:36:27] I'll keep the old values [15:37:25] ottomata: if it's kafka/camus delay, it's a good example of how useful will be the CamusPartitionChecker [15:37:36] i'm running seq stat check for one host now [15:37:42] to see if it is currently not missing anything [15:37:55] ok ottomata, let me know if we need to backfill [15:37:58] k [15:40:17] yeah, joal, 0 loss now. [15:42:44] rham [15:42:53] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [15:43:13] milimetric: today we have a good eample of why backfilling is needed [15:43:22] :( [15:43:24] what happened [15:43:45] milimetric: delay in camus [15:43:51] data partially imported [15:43:55] but actually there [15:44:05] if job had started later [15:44:49] addshore: nice but ahem... how come we cannot ping the machine? [15:45:19] no idea :P [15:53:28] ottomata: do you mind checking upload on hour 16? [15:58:44] joal: ahhh, can't got an interview now [15:58:48] will miss standup a-team [15:58:49] joal [15:58:58] ok ottomata [15:59:01] ok ottomata, will do myself :) [15:59:08] this is what I ran to tcheck [15:59:09] https://gist.github.com/ottomata/b8debf5346c73556a156 [15:59:12] just edit as you need [15:59:24] sure, will also the on_call doc [16:02:22] is kevin in a room somewhere [16:03:50] madhuvishy: he's on the 5th floor [16:05:51] ottomata: coming to standup [16:07:20] no in interview [16:22:31] milimetric: https://gerrit.wikimedia.org/r/#/c/247866/ FYI [16:22:59] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [16:25:29] Analytics-Backlog, Analytics-Kanban: Add help page in wikitech on what the analytics team can do for you similar to release engineering page - https://phabricator.wikimedia.org/T116188#1742698 (Nuria) NEW [16:25:43] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [16:32:18] Analytics-Tech-community-metrics, DevRel-October-2015: Correct affiliation for code review contributors of the past 30 days - https://phabricator.wikimedia.org/T112527#1742714 (Aklapper) * This is time consuming. * I gave priority on merging accounts of people active in last 30 days. * I gave priority on... [16:49:30] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. - https://phabricator.wikimedia.org/T116097#1742766 (Milimetric) p:Triage>Normal [16:51:09] ottomata: can you come to tasking for a bit? [16:55:55] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. {hawk} [8 pts] - https://phabricator.wikimedia.org/T116097#1742788 (Milimetric) [16:57:02] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: https://yarn.wikimedia.org/cluster/scheduler should be behind ldap - https://phabricator.wikimedia.org/T116192#1742793 (Nuria) NEW [16:57:10] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: https://yarn.wikimedia.org/cluster/scheduler should be behind ldap - https://phabricator.wikimedia.org/T116192#1742801 (Nuria) p:Triage>High [17:06:19] just finished interview [17:23:43] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [17:24:52] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [17:39:30] ottomata: do you know who created https://hub.docker.com/u/wikimedia/ ? [17:39:39] i know you guys were looking into docker which is why i'm asking [17:47:34] Analytics-Backlog: Sanitize pageview_hourly - https://phabricator.wikimedia.org/T114675#1742995 (JAllemandou) a:JAllemandou [17:51:45] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [17:52:10] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: Add MAPJOIN configuration so in memory mapjoins happen by default in hive [1] - https://phabricator.wikimedia.org/T116202#1743009 (Nuria) NEW a:Nuria [17:52:28] ori: may ask yuvi as well they have done a lot of container stuff for kubernetes trial [17:56:59] ori nm, no i don't [17:57:27] mforns: https://github.com/wikimedia/restbase/blob/master/specs/analytics/v1/pageviews.yaml [17:57:49] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.000 second response time on port 9042 [18:02:00] joal: ok, what were those storage numbers?! 2 TB for 1/2 month?! [18:02:38] we are the cave [18:02:42] milimetric: --^ [18:02:47] ottomata: around? [18:02:58] i wanna chat about burrow puppet stuff [18:07:36] madhuvishy: yes! [18:07:42] lets chat [18:08:03] okay [18:08:03] so [18:08:17] firstly, we need to figure out how to package burrow [18:08:51] ottomata: https://phabricator.wikimedia.org/T116084 [18:08:56] Analytics, Beta-Cluster-Infrastructure, Services: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#1743135 (mobrovac) NEW [18:10:27] hehe, yeah! [18:10:37] ottomata: the part I don't know is what would happen when it's packaged and installed - would the burrow script be available somewhere like /usr/bin/burrow? [18:11:36] madhuvishy: i haven't looked at runnig burrow much yet, reading... [18:11:55] madhuvishy: ideally [18:12:02] the burrow script would be somewhere like that, yes [18:12:12] and then a systemd service unit file would configure it to be started as a daemon [18:12:24] ottomata: oh cool, ya that was my next question [18:12:35] madhuvishy: let's ask in #ops, i think moritz is still around, maybe. [18:12:36] because there is no way to specify how to kill it now [18:12:42] or restart [18:12:50] ottomata: okay [18:17:21] Analytics-Backlog, Analytics-Kanban: Improve record size on cassandra storage for pageview API data - https://phabricator.wikimedia.org/T116209#1743188 (Nuria) NEW a:Milimetric [18:24:55] !log Stopped per article loading in cassandra [18:24:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:31:12] madhuvishy: aye ok, let's wait for moritz' package, i'm trying to make it work on a VM but not getting very far [18:31:22] ottomata: alright [18:31:33] madhuvishy: for the p uppetization, i think you can assume that you will have a package called golang-burrow or whatever [18:31:36] anyone know about what % of our users have Do Not Track enabled, by chance? [18:31:37] and a service called burrow [18:31:43] ottomata: yeah alright [18:31:46] probably config files in /etc/burrow [18:31:55] cool [18:31:56] so, you can probably go about making config file templates and parameterizing whatever is needed [18:32:05] ottomata: yeah erb sounds like fun :P [18:32:15] ottomata: other question [18:32:22] yes [18:32:26] do we only want to be notified by email? [18:32:44] about? [18:32:45] burrow will post to a http endpoint if we set that up [18:32:48] consumer lag? [18:32:49]