[00:01:56] (PS1) Yurik: https by default, bug fix, saferun tile output [analytics/zero-sms] - https://gerrit.wikimedia.org/r/247763 [00:02:17] (CR) Yurik: [C: 2 V: 2] https by default, bug fix, saferun tile output [analytics/zero-sms] - https://gerrit.wikimedia.org/r/247763 (owner: Yurik) [00:18:51] Analytics, Continuous-Integration-Config, WMDE-Analytics-Engineering, Wikidata: Add basic jenkins linting to analytics-limn-wikidata-data - https://phabricator.wikimedia.org/T116007#1740759 (Dzahn) [00:19:05] Analytics, Continuous-Integration-Config, WMDE-Analytics-Engineering, Wikidata: Add basic jenkins linting to analytics-limn-wikidata-data - https://phabricator.wikimedia.org/T116007#1738262 (Dzahn) [00:39:56] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [00:41:37] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [01:24:31] Analytics-Backlog: Create deb package for Burrow - https://phabricator.wikimedia.org/T116084#1740907 (madhuvishy) [01:33:44] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. - https://phabricator.wikimedia.org/T116097#1740916 (kevinator) I think we should look at a 5 year horizon. How much will we need to spend when the hardware goes out of warranty and needs to be replaced. Keep in mind t... [01:49:03] Analytics-Backlog: Make beeline easier to use as a Hive client {hawk} - https://phabricator.wikimedia.org/T116123#1740922 (madhuvishy) NEW [10:05:38] hi a-team :] [10:07:00] (PS1) Christopher Johnson (WMDE): adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 [10:13:41] (PS2) Christopher Johnson (WMDE): adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param fixes config path in bulk_sparql script [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 [10:15:03] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] adds internal_reference function fixes column width for datatables on home fixes seeAlso missing param fixes config path in bulk_sparql scri [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247799 (owner: Christopher Johnson (WMDE)) [10:43:55] joal, yt? [11:09:50] (PS1) Christopher Johnson (WMDE): adds Wikimedia Categories to home view [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247808 [11:09:52] (PS1) Christopher Johnson (WMDE): fixed untracked files [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247809 [11:11:50] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] adds Wikimedia Categories to home view [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247808 (owner: Christopher Johnson (WMDE)) [11:12:14] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] fixed untracked files [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247809 (owner: Christopher Johnson (WMDE)) [11:34:15] (PS3) Mforns: Add oozie job to compute browser usage reports [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) [11:37:45] hey mforns [11:37:49] Went to lunch :) [11:37:52] hi joal! [11:38:02] wassup? [11:38:31] I was going to try and test the eventlogging patch in beta cluster [11:38:44] okeyyy [11:39:01] do you want to this too? [11:39:08] sure ! [11:39:20] Do you give me 5 mins before starting ? [11:39:26] Then fully up with you :) [11:39:34] of course [11:39:42] k brb [11:48:50] mforns: to batcave :) [11:48:54] joal, omw [12:40:13] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [12:41:39] (PS1) Addshore: Improve README [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247824 [12:43:44] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [13:03:55] (CR) Addshore: [C: 2 V: 2] Improve README [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/247824 (owner: Addshore) [13:04:20] ottomata, yt? [13:08:13] hiya yup [13:08:22] milimetric: sorry i missed your text yesterday, [13:08:39] sok! [13:08:46] after hours, didn't expect you to be around [13:08:55] just checking in case you were like staring at a wall and would rather deploy instead [13:10:05] Quick query again all! How longs does data stay in graphite? Is there a way to make it stay longer? and if I understand correctly anyone with cluster access can just throw more data into graphite? [13:10:33] addshore: it stays for a while but starts losing resolution [13:10:40] ahh okay [13:10:41] I'm not sure when it loses what resolution level [13:10:51] thats good to know! :) [13:10:53] mforns_lunch: hiya [13:10:58] and yeah, we use statsd to send data to graphite, here's an example for event logging, one sec [13:11:33] milimetric any idea if at some point the resolution becomes 0? or just less and less but never 0? [13:11:36] this is where we create the client: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L250 [13:12:06] and here's the puppet that configures that host: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/eventlogging.pp#L38 [13:12:34] addshore: I don't think it ever becomes zero... I think yearly at worst, but let me check back, I'm curious [13:12:42] awesome! [13:13:27] for a bit of background, im working on tracking wikidata metrics, and currently I have a bunch of scripts getting data, and storing in mysql and outputing as tsv. I'm wondering if it would make more sense to kill the sql and chuck it all in graphite [13:13:59] though, I feel I need to know more about graphite before I di that ;) [13:14:33] addshore: as long as you're just counting things, then I think graphite's great. By the way, did you know you can use grafana to hit the same data and make your graphs look much prettier? [13:14:45] here's the example for EL: https://grafana.wikimedia.org/dashboard/db/eventlogging [13:14:51] milimetric: yup! and basically everything is counting! [13:15:00] welll :) [13:15:10] not quite, but a lot of things are [13:15:12] Currently we have wdm.wmflabs.org/wdm/ but personally I would prefer to use grafana [13:15:15] http://wdm.wmflabs.org/wdm/ [13:15:20] using shiney [13:15:46] yeah, I like shiney, but if you're just counting it's easier to use grafana [13:15:58] of course, it's also easier to just keep what you have since you already did all the work [13:16:05] well, I would say most of it is just counting, and the some other calculating [13:16:08] but I get it if it makes you feel dirty :) [13:16:33] can you add legacy data / data with old timestamps into graphite? [13:16:42] or just live / now data? [13:19:05] (CR) Joal: [C: -1] "Comments inline :)" (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/246851 (https://phabricator.wikimedia.org/T88504) (owner: Mforns) [13:19:39] addshore: I've never done that, I briefly looked through the docs and couldn't find a way [13:19:49] okay! [13:19:59] right, time for some more reading I think :) [13:22:07] addshore: Madhu used Spark to send data to Graphite, and that data was historical :) [13:22:16] ooooohhhh [13:22:34] right, that's good, because that means I dont have to try and rush into a decision ;) [13:23:04] How about a way to easily export all data stored in graphite for a metric? say to a tsv? [13:23:08] I think the message sent to graphite contains time identification [13:23:36] hm, data export from graphite I don't really know [13:23:48] I think graphite provides an API to get data [13:24:02] okay, yep, going to go and trawl through the docs then! :) [13:24:13] thanks all! :) as always! [13:24:13] Meaning, the charts are displayed from getting data through an API, so there must be a way to get your hands over the raw data [13:24:34] np addshore [13:24:52] milimetric: Do you want me to test your oozie stuff ? [13:25:12] It looks good, but if it makes feel better, I'll test it :) [13:25:16] joal: did you see the oozie tests I talked about in the comments? [13:25:23] or did you mean you want to re-test because of the changes? [13:26:08] The data sent must be in the following format: , Cool! [13:26:31] re-reading oozie test [13:27:51] seems fine to me milimetric, I assume you have double checked the exported files in hdfs ? [13:27:59] milimetric: I found the retention info! https://github.com/wikimedia/operations-puppet/blob/2e1a45c0f89082564e3e201ebfc0cdd0a1db5871/manifests/role/graphite.pp#L38 [13:28:40] joal: yes, I head and tail-ed both of them [13:28:44] and they looked legit [13:28:44] so it looks like after a year it will drop out [13:29:08] addshore: aha, oh! that makes the EL data make sense now [13:29:24] I wasn't seeing anything beyond a year ago and was sure we had collected before that [13:29:37] but in theory graphite can keep as much as you have storage [13:30:13] how much storage do we have? :P [13:30:26] we could easily add 1d:5y and 1w:100y if ops is ok with it [13:30:34] or something like that [13:30:56] I might go and propose a patch for that (or something like that). [13:31:25] as if we have that (or something similar) then graphite would be perfect for 99% of the stuff people want me to track.. [13:31:30] makes sense to me except i wish it could be customizable [13:31:38] it sucks that if you do that it'll start keeping data for everything [13:32:13] yeh, well, it can match regexes [13:32:59] so in theory I could just have longer retention for my stuff [13:34:40] (CR) Joal: [C: 2] "Leaving as is for merge synchro, good to go !" [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:35:04] joal: that can be merged, it's next on the deployment plan [13:35:10] as long as the deployment plan makes senes [13:35:13] *sense [13:35:49] ottomata: ping me when you're ready to deploy AQS [13:35:56] I made this etherpad: https://etherpad.wikimedia.org/p/analytics-deploy-aqs [13:36:11] I did all the steps up to the deploy section [13:36:20] milimetric: about deployment, have the aggregator changed been merged ? [13:36:42] joal: yeah, you merged them yesterday when you +2-ed it [13:36:48] milimetric: lets do it [13:36:52] I left you a comment about it, some repos automatically merge when you +2 [13:36:54] got some updated docuemntations? [13:37:01] (CR) Ottomata: [C: 1] Archive hourly pageviews in legacy format [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:37:04] that etherpad I linked to ^ [13:37:09] oh thanks [13:37:13] (CR) Joal: [V: 2] "Good for me, merging." [analytics/refinery] - https://gerrit.wikimedia.org/r/246149 (https://phabricator.wikimedia.org/T114379) (owner: Milimetric) [13:37:20] hmmm, milimetric, really ? [13:37:24] ottomata: I did all the steps, we just have to do the "deploy" section [13:37:28] joal: yes, really :) [13:37:35] arf, memory [13:37:47] milimetric: i think I don't need to do the update steps, right? [13:37:50] k [13:38:03] hahaha, if I was worried about missing things I'd be under a desk with a tinfoil hat rocking back and forth [13:38:18] ottomata: right, so just the --check and the real one if the check's ok [13:38:24] k [13:39:11] ok all 3 hosts at check look ok [13:39:18] !log deploying aqs [13:39:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:39:53] dan, shoudl we put this at https://wikitech.wikimedia.org/wiki/Analytics/AQS [13:39:54] ? [13:39:58] your etherpad stuff? [13:40:04] milimetric: deploy done. [13:40:30] milimetric: I left a comment saying, not merging for synchro, and it merged nonetheless ! [13:40:30] so FYI you can get raw data of graphs as json, https://graphite.wikimedia.org/render/?target=jobrunner.pop.wikibase-addUsagesForPage.ok.mw1004.count&format=json , still hunting for just the data [13:40:38] I assumed jenkins to read my comments, pfff [13:41:58] stupid jenkins [13:42:18] ottomata: hm... I wonder we probably have to restart restbase now, right? [13:42:27] milimetric: shall we deploy and restart the coordinators ? [13:42:31] I was just checking and it doesn't seem to be serving at the new address [13:42:39] joal: yes, after aqs [13:42:41] milimetric: it should restart it as part of deploy [13:42:53] did it say it did that properly in the output? [13:43:01] ha, ithink so, buuuut i have closed the window [13:43:04] doesn't hurt to deploy again, ja? [13:43:05] milimetric: hm, thinking of that, there is a big bunch of changes to deploy, let's wait for nuria [13:43:35] milimetric: i'm running deploy again to see [13:43:51] hm... something's weird, it doesn't have my change in the source [13:43:52] oh, now it says skipping restart because nothing changed [13:44:04] mmhm.... [13:44:07] hm.... [13:44:08] oh! [13:44:17] hm, there is a restart task, shall I run it? [13:44:25] wait, no, it's not pushing my changes [13:44:42] ah, ottomata, my bad, I missed a step: [13:44:42] https://gerrit.wikimedia.org/r/#/c/247726/ [13:44:45] that needs to be merged [13:47:21] ok so deploy again? [13:47:24] it is merged now, ja? [13:47:27] ok, ottomata, sorry, I updated the etherpad, yes, ansible again [13:47:33] yes, merged now [13:47:41] we should get that repo to ping us here in chat [13:48:05] i think you should get deploy perms [13:48:07] :) [13:48:51] yes, that would be nice [13:48:58] especially since this has so many moving parts [13:49:01] make a ticket? [13:49:08] you and joseph [13:49:11] good idea, doing [13:49:16] thx guys [13:49:18] it might mean i need sudo, which is scary [13:49:21] i think you need separate tickets for that [13:49:28] hopefully you only need sudo to a certain user [13:49:38] ok, looks good milimetric it did restart restbase [13:51:53] * milimetric checks [13:55:37] guys I'm having this problem that's driving me crazy [13:56:13] if I press the up arrow in my shell to see my history, it's coming up distorted, chopped in half with part of the command mangled [13:56:27] anyone get that and know how to fix it? [13:57:54] haha nope [13:58:18] happens to me sometime milimetric [13:58:25] never managed to have it fixed :( [14:00:12] it's happening every time and it's like ... maddening [14:00:26] I have to type out the whole aqs url every time [14:01:49] milimetric: so retaining everything in graphite at a daily scale for 50 years (after the current retention) would result in 170% of the current storage being used :P [14:02:01] just to put a random number out there ;) [14:02:20] yeah, but after a year is daily resolution even worth it? [14:02:41] I'd go to weekly resolution personally, maybe after 1 year or maybe after 3 years [14:02:49] depends on what data your recording :) [14:04:24] well, daily for 10 years and weekly for 50 would just be 124% of the current ammoutn [14:05:35] but, again, it still might be better to put the metrics I want in something that can easily be matched by a regex and then have another resolution applied [14:06:15] that probably has the best chance of being merged [14:06:51] right, I'll think about that then! and try and tihnk of a place / name start that other could potentially use too [14:08:01] ottomata: thanks, all looks good [14:08:09] just took me like 30 minutes to check as I fought with my CLI [14:08:10] ughhh [14:08:11] :) [14:08:11] ottomata: there was loss in today's report for raw webrequestrs [14:08:20] Anything coming to mind ? [14:09:08] oooo, interesting, This retention is set at the time the first metric is sent, so changing the config in puppet wotn alter the retention of any metrics already stored [14:10:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [14:11:15] joal: wanna do the coordinators now? [14:12:18] joal i saw that, the 30% in the one mobile hour! not sure, no. [14:12:48] milimetric: if we wait for nuria there'll the cassandra to be merged and deployed in the mean time [14:12:52] milimetric: is that ok ? [14:13:03] ottomata: will have a deeper look [14:13:06] mmmmmmmmmmm okkkk :) [14:13:20] mforns_lunch: lemme know when you're back, we can do the EL stuff? [14:13:27] milimetric: you are too nice to me, remember we were planning to be better at saying no ;) [14:13:52] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:13:54] oh [14:14:03] * milimetric kicks joal in the shin [14:14:11] there [14:14:16] :D [14:32:37] milimetric, hi! [14:32:44] I'm back :] [14:33:22] cool, I was gonna try to restore the missing data using your changes [14:33:38] milimetric, we tried earlier today with Joseph [14:33:41] oh! [14:33:45] sweet [14:33:49] how'd that go? [14:33:51] but we had no sudo permits in deployment-eventlogging03 [14:34:01] since the change to 03 I haven't asked for them [14:34:31] I didn't know I should... I thought permits would be transmitted [14:34:55] hm, weird [14:34:59] but we can test on an1004 [14:35:07] mmm ok [14:35:39] hm, but i suppose that's a pretty beefy machine and it'd be hard to replicate it going OOM [14:36:12] milimetric, I wasn't planning to test it for OOM [14:36:20] just to test it for normal functioning [14:36:27] k, then an04 should be great [14:36:45] I'll checkout the change there and consume a file into sql [14:36:54] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [14:37:03] is it analytics1004.eqiad.wmnet? [14:37:37] ok [14:37:40] batcave? [14:38:15] in which folder are you checking out? [14:38:19] there's nothing in srv [14:38:46] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:44:11] milimetric, can I help, batcave? [14:44:20] pair? [14:44:25] mforns: ah, sorry :) [14:44:26] sure! [14:44:30] cool [14:49:22] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [14:51:03] (PS1) Joal: Correct camus-partition-checker to use hdfs conf [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [14:51:04] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [14:58:39] In theroy should sanding somehing to statsd with a one liner like "echo "test.foo:100|c" | nc -w 1 -u statsd.eqiad.wmnet 8125" work? :/ [14:59:53] addshore: can you see statsd from the machine you are doing those commands? [15:00:04] addshore: it is not accesible from all places [15:00:11] ahhhh [15:00:29] I cant ping it ;) [15:01:36] addshore: right, cause -if i rememeber this right- there are several statsd hosts you can reach from certain others but not from anywhere, the deatils you have to ask in ops channel [15:01:40] *details [15:01:49] okay! [15:05:32] addshore: btw, handy tcpdump to see whatis going on: tcpdump host statsd.eqiad.wmnet -nnXSs 0 [15:06:43] https://usercontent.irccloud-cdn.com/file/19tZllz5/ [15:06:56] FYI I was looking at sending stuff to statsd from stat1002 [15:17:43] addshore: ah, not possible probably [15:18:03] sad times, wonder if I can find the thing in puppet I would need to poke [15:18:07] mforns: i think we should merge this one: https://gerrit.wikimedia.org/r/#/c/246796/4 [15:18:43] addshore: i do not think 1002 is mean to connect to statsd at all but cc ottomata in case i am wrong [15:19:09] addshore: our reporting through statsd is done in the EL nodes [15:19:21] EL nodes? ;) [15:19:27] addshore: eventlogging [15:19:37] ahh, okay. [15:21:27] hmmm [15:21:41] ottomata: regular loss on hour 10 yesterday for every mobile and maps varnish :( [15:21:44] i think stat1002 should be able to reach statsd [15:21:59] joal: in all clusters? [15:22:06] yessir [15:22:06] esams, etc.? [15:22:30] maps only served from eqiad, but mobile, all clusters [15:22:35] i mean, that sounds like a restart of some kind, buuuuuut, we are ignoreing where seq = 0, right? [15:22:37] ottomata: --^ [15:22:45] ottomata: no seq=0 [15:25:40] ottomata: i'll keep poking and see [15:26:53] joal this is 2015-10-20T10 ja? [15:27:08] yup [15:27:59] mforns: i think we can merge the EL sleep patch right? [15:28:32] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [15:29:53] addshore: you need to run tcpdump cpommand as root [15:30:03] addshore: but again you probably do not have root on 1002 [15:30:54] addshore: you CAN ping statsd from 1002: ping statsd.eqiad.wmnet [15:31:22] addshore: ah , no, wait [15:31:51] :P [15:32:10] I tried tel-netting on the port too, but nothing [15:33:22] oh nuria apparently it is working! https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445441534.023&target=test.bash.foo.rate&target=test.bash.foo.lower [15:34:17] hooray! [15:34:56] very strange joal, i don't see any varnishkafka delivery errors... [15:35:06] rhmmm [15:35:06] i do see that camus seemed to have a short lag (i think) at about that time [15:35:14] the number of bytes out of kafka was flat for a bit then [15:35:43] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [15:36:15] ottomata: let's rerun the load for this guy, and see if there is any diff [15:36:18] right ? [15:36:27] I'll keep the old values [15:37:25] ottomata: if it's kafka/camus delay, it's a good example of how useful will be the CamusPartitionChecker [15:37:36] i'm running seq stat check for one host now [15:37:42] to see if it is currently not missing anything [15:37:55] ok ottomata, let me know if we need to backfill [15:37:58] k [15:40:17] yeah, joal, 0 loss now. [15:42:44] rham [15:42:53] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [15:43:13] milimetric: today we have a good eample of why backfilling is needed [15:43:22] :( [15:43:24] what happened [15:43:45] milimetric: delay in camus [15:43:51] data partially imported [15:43:55] but actually there [15:44:05] if job had started later [15:44:49] addshore: nice but ahem... how come we cannot ping the machine? [15:45:19] no idea :P [15:53:28] ottomata: do you mind checking upload on hour 16? [15:58:44] joal: ahhh, can't got an interview now [15:58:48] will miss standup a-team [15:58:49] joal [15:58:58] ok ottomata [15:59:01] ok ottomata, will do myself :) [15:59:08] this is what I ran to tcheck [15:59:09] https://gist.github.com/ottomata/b8debf5346c73556a156 [15:59:12] just edit as you need [15:59:24] sure, will also the on_call doc [16:02:22] is kevin in a room somewhere [16:03:50] madhuvishy: he's on the 5th floor [16:05:51] ottomata: coming to standup [16:07:20] no in interview [16:22:31] milimetric: https://gerrit.wikimedia.org/r/#/c/247866/ FYI [16:22:59] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [16:25:29] Analytics-Backlog, Analytics-Kanban: Add help page in wikitech on what the analytics team can do for you similar to release engineering page - https://phabricator.wikimedia.org/T116188#1742698 (Nuria) NEW [16:25:43] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [16:32:18] Analytics-Tech-community-metrics, DevRel-October-2015: Correct affiliation for code review contributors of the past 30 days - https://phabricator.wikimedia.org/T112527#1742714 (Aklapper) * This is time consuming. * I gave priority on merging accounts of people active in last 30 days. * I gave priority on... [16:49:30] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. - https://phabricator.wikimedia.org/T116097#1742766 (Milimetric) p:Triage>Normal [16:51:09] ottomata: can you come to tasking for a bit? [16:55:55] Analytics-Backlog, Analytics-Kanban: Projections of cost and scaling for pageview API. {hawk} [8 pts] - https://phabricator.wikimedia.org/T116097#1742788 (Milimetric) [16:57:02] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: https://yarn.wikimedia.org/cluster/scheduler should be behind ldap - https://phabricator.wikimedia.org/T116192#1742793 (Nuria) NEW [16:57:10] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: https://yarn.wikimedia.org/cluster/scheduler should be behind ldap - https://phabricator.wikimedia.org/T116192#1742801 (Nuria) p:Triage>High [17:06:19] just finished interview [17:23:43] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [17:24:52] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [17:39:30] ottomata: do you know who created https://hub.docker.com/u/wikimedia/ ? [17:39:39] i know you guys were looking into docker which is why i'm asking [17:47:34] Analytics-Backlog: Sanitize pageview_hourly - https://phabricator.wikimedia.org/T114675#1742995 (JAllemandou) a:JAllemandou [17:51:45] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [17:52:10] Analytics-Backlog, Analytics-Cluster, Analytics-Kanban: Add MAPJOIN configuration so in memory mapjoins happen by default in hive [1] - https://phabricator.wikimedia.org/T116202#1743009 (Nuria) NEW a:Nuria [17:52:28] ori: may ask yuvi as well they have done a lot of container stuff for kubernetes trial [17:56:59] ori nm, no i don't [17:57:27] mforns: https://github.com/wikimedia/restbase/blob/master/specs/analytics/v1/pageviews.yaml [17:57:49] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.000 second response time on port 9042 [18:02:00] joal: ok, what were those storage numbers?! 2 TB for 1/2 month?! [18:02:38] we are the cave [18:02:42] milimetric: --^ [18:02:47] ottomata: around? [18:02:58] i wanna chat about burrow puppet stuff [18:07:36] madhuvishy: yes! [18:07:42] lets chat [18:08:03] okay [18:08:03] so [18:08:17] firstly, we need to figure out how to package burrow [18:08:51] ottomata: https://phabricator.wikimedia.org/T116084 [18:08:56] Analytics, Beta-Cluster-Infrastructure, Services: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#1743135 (mobrovac) NEW [18:10:27] hehe, yeah! [18:10:37] ottomata: the part I don't know is what would happen when it's packaged and installed - would the burrow script be available somewhere like /usr/bin/burrow? [18:11:36] madhuvishy: i haven't looked at runnig burrow much yet, reading... [18:11:55] madhuvishy: ideally [18:12:02] the burrow script would be somewhere like that, yes [18:12:12] and then a systemd service unit file would configure it to be started as a daemon [18:12:24] ottomata: oh cool, ya that was my next question [18:12:35] madhuvishy: let's ask in #ops, i think moritz is still around, maybe. [18:12:36] because there is no way to specify how to kill it now [18:12:42] or restart [18:12:50] ottomata: okay [18:17:21] Analytics-Backlog, Analytics-Kanban: Improve record size on cassandra storage for pageview API data - https://phabricator.wikimedia.org/T116209#1743188 (Nuria) NEW a:Milimetric [18:24:55] !log Stopped per article loading in cassandra [18:24:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:31:12] madhuvishy: aye ok, let's wait for moritz' package, i'm trying to make it work on a VM but not getting very far [18:31:22] ottomata: alright [18:31:33] madhuvishy: for the p uppetization, i think you can assume that you will have a package called golang-burrow or whatever [18:31:36] anyone know about what % of our users have Do Not Track enabled, by chance? [18:31:37] and a service called burrow [18:31:43] ottomata: yeah alright [18:31:46] probably config files in /etc/burrow [18:31:55] cool [18:31:56] so, you can probably go about making config file templates and parameterizing whatever is needed [18:32:05] ottomata: yeah erb sounds like fun :P [18:32:15] ottomata: other question [18:32:22] yes [18:32:26] do we only want to be notified by email? [18:32:44] about? [18:32:45] burrow will post to a http endpoint if we set that up [18:32:48] consumer lag? [18:32:49] hm [18:32:53] and email [18:33:06] hm, i guess email is fine for now, not sure what else we owuld have it post to [18:33:09] but i dont know what http endpoint [18:33:11] yeah [18:33:15] ideally we'd get it into icinga [18:33:19] ottomata: yup [18:33:30] but we should run a bot of some kind on a port [18:33:47] that would read the data and send it to graphite/icinga or sth [18:34:22] ja that is possible, sounds a little hacky though. would be nice if we could just have it as an icinga check somehow, dunno [18:34:23] we'll ahve to see [18:34:27] email is fine to start with though [18:34:30] ottomata: okay yeah [18:34:56] i dont know what the smtp server should be though [18:36:00] ottomata: also, where are the hiera config files for our stuff - i don't know how to find them [18:43:20] hieradata [18:43:20] hehe [18:43:32] all over the place [18:46:27] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [30.0] [18:48:17] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [19:00:11] AndyRussG: did you get an answer to DNT? [19:00:25] nuria: not yet :) [19:01:03] I guess it varies by region, since it'd mainly be IE users, no? [19:01:07] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [19:01:38] AndyRussG: answer is we do not know, not sure why would be be IE users... [19:01:54] nuria: 'cause for a while it was enabled by default on IE [19:02:02] At least IE 10 [19:02:15] AndyRussG: you can set it in chrome which is our most popular browser [19:02:26] https://en.wikipedia.org/wiki/Do_Not_Track#Internet_Explorer_10_default_setting_controversy [19:02:54] yeah... But it seems most people won't set it themselves [19:02:54] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1743346 (dr0ptp4kt) [19:02:57] AndyRussG: ah , sorry, but look at browser stats [19:03:06] Yeah [19:03:07] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [19:03:25] AndyRussG: IE10 % last time i checked was less than other IE browsers [19:05:22] I was just thinking someone might have a DNT percentage readily available. So far it looks like we may be getting 30% less EL events than we should be for banner history, but that may also be due to some other issues [19:08:53] AndyRussG: have you looked at the EventErrors for validation errors? [19:09:21] madhuvishy: good point! I don't think so [19:11:57] AndyRussG: they are now available here - https://logstash.wikimedia.org/#/dashboard/elasticsearch/eventlogging-errors [19:12:20] madhuvishy: amazing, thanks! [19:12:31] thanks also nuria! [19:14:51] AndyRussG: 30% cannot be dnt, this is a far fetched comparation [19:15:06] AndyRussG: but think that less people will have dnt than addblock [19:15:22] AndyRussG: and add block i think last time i checked was < 10% global [19:15:46] AndyRussG: and with addblock EL will not work either [19:15:55] what if it's 5% most browsers + 90% IE10 or Windows 8.1? [19:16:11] AndyRussG: did you looked at browser stats for IE10? [19:16:29] not yet... forgot where I put that link :/ [19:18:13] AndyRussG: https://docs.google.com/a/wikimedia.org/spreadsheets/d/1n9FhSqcBGM9iKXrlHsP0EZI0gU89Rmz5m51uglUGVjs/edit?usp=drive_web [19:18:32] AndyRussG: as i said ie10 is not very popular [19:24:05] nuria: thx... Yeah... Checking it seems it's enabled for IE10 and IE11, but it's still a much smaller percentage [19:24:25] AndyRussG: also for both it can be disabled by user [19:24:41] True [19:28:17] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [19:29:39] nuria, can I ask an incredibly dumb question I swear I used to know the answer to? [19:29:49] nuria: madhuvishy: K so our lower-than-expected EL result is surely not due to DNT nor EL validation errors (of which there are just a few, due to just a minor bug in our schema def I see) [19:29:52] Ironholds:of course [19:30:15] Analytics-Backlog: Sanitize pageview_hourly - https://phabricator.wikimedia.org/T114675#1743499 (Tbayer) Question: Does this involve removing values of country data and/or access_method as well, or just the more fine-grained user_agent_map and city? (It's not clear to from the linked Phabricator task and medi... [19:30:38] nuria, when I run mvn compile on refinery-source, where does the built JAR end up ;p [19:31:00] you have to run mvn package [19:31:08] for the jar to be build [19:31:15] under artifacts [19:31:25] ah, that's the problem. Danke! [19:32:09] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [19:32:13] AndyRussG: from prior experience when that has happen before [19:33:03] AndyRussG: it was client issues, did you looked whether your messages might be too big (we already mentioned that schema was sending a lot of info instead of events) [19:33:11] AndyRussG: cause too big messages will get dropped [19:34:16] nuria: it could be client issues. Also could be a miscalculation, this was just from the firs dive into the data. We are actually checking on the client that the EL payload isn't too big, though that doesn't mean the check is necessarily perfect [19:35:05] Wouldn't we see EventErrors due to a validation error if the EL event was too big? [19:36:51] AndyRussG: you can look at your schema here and it looks like there rae no clientside errors: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema [19:37:12] AndyRussG: no, you wouldn't , if event is too big it doesn't get sent [19:37:39] AndyRussG: you will see errors reported from client if you are using mediawiki eventlogging code to log events [19:38:14] nuria: aaah OK that's good to know. I thought I saw somewhere that it caused a parsing error. [19:38:31] Sorry what do you mean by the last point? [19:39:08] nuria: BTW Ellery just re-calculated the expected numbers vs. actual and it's much closer than in the previous calculation [19:39:17] AndyRussG: that "too big" events do not get reported as validation errors, they get reported by mediawiki client side javascript code [19:39:28] as "plain" errors [19:39:57] nuria: i.e., just visible in the JS console like other JS errors, correct? [19:40:07] AndyRussG: yes [19:40:17] nuria: cool! [19:40:45] Analytics-Backlog: Sanitize pageview_hourly - https://phabricator.wikimedia.org/T114675#1743534 (Nuria) For now we are concerned just with user agent map. [19:42:06] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [19:43:30] (Abandoned) Nuria: [WIP] Pageview Hourly quality check whitelist [analytics/refinery] - https://gerrit.wikimedia.org/r/246118 (https://phabricator.wikimedia.org/T110061) (owner: Nuria) [19:47:28] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.020 second response time on port 9042 [19:48:24] hey [19:48:25] nuria: madhuvishy: K I'll just summarize the above for FR folks and we'll see where we go from here... The dropoff now seems to be about 16% in one case and 5% in another, which is more reasonable in any case [19:48:28] thx again! [19:49:09] In EventLogging, we're providing nulls for an integer type field and getting errors because of it. Is that expected? [19:49:33] Analytics, Beta-Cluster-Infrastructure, Services: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#1743578 (hashar) [19:50:07] Analytics, Beta-Cluster-Infrastructure, Services, WorkType-NewFunctionality: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#1743135 (hashar) [19:51:03] hey yall, is AQS flapping because of cassandra loading? [19:53:06] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [19:53:20] milimetric: ^ [19:53:20] ? [19:53:23] i'm getting texted! [19:53:38] gr [19:53:51] Krenair: i think so. null is not an integer [19:54:22] not sure if it's related ottomata, but we found we're using waaay more space on that machine than we thought [19:54:33] I'm currently working on a patch to reduce the size of the data we load [19:54:41] but I've no idea why it's timing out [19:54:49] maybe there's just too much data [19:55:05] you'd think that wouldn't be a problem... [19:55:45] hm, 2.6T used on aqs1001, still 7.4T avail [19:56:04] milimetric: can we disable alerts on AQS for a while? until we know its ready? [19:57:07] oh that is cassandra! [19:57:09] geez [19:58:04] ottomata: yeppp [19:58:16] madhuvishy: ? :) [19:58:24] cassandra and data sizes [19:58:30] we ranted about it all morning [19:59:52] milimetric: madhuvishy: https://gerrit.wikimedia.org/r/#/c/247910/ [20:03:57] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.998 second response time on port 9042 [20:04:12] ottomata: cool [20:04:50] ottomata: I'm very confused [20:04:55] so nobody was querying it but it died? [20:05:12] how the hell is it going to respond to any kind of load then? [20:05:15] milimetric: its a TCP connection checker [20:05:17] and it can timeout [20:05:27] i dunno! [20:05:37] but, wth, if it can't even respond to a connection checker, how the hell... [20:05:41] ughhhh [20:06:00] :| [20:07:40] (i'm asking services people why this thing's timing out) [20:07:48] is it because of the inserts we're doing? [20:07:58] i mean, i thought that was the whole point, you can insert a lot and it's all good [20:08:08] i thought joseph turned off some inserts [20:08:17] yeah, i thought he turned off the per-article ones [20:08:23] even if he didn't!!! [20:08:26] those are biggest ones no [20:08:38] right, but we need it to be able to insert, wth! [20:08:48] yeah of course [20:09:07] like - this database is great. wait... you need to INSERT into it?!! never mind, gotta find something eles [20:14:20] ottomata: can you specify default values for things in config files in erb? [20:15:59] madhuvishy: hm, yes, although i'd specify them in the puppet manifests, not in the erb template [20:16:12] ottomata: aah [20:16:13] okay [20:19:11] Analytics-Kanban, operations, Monitoring, Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1743691 (Ottomata) Rats, I see some segfaults: Oct 20 03:40:14 cp4014 kernel: [13259932.218753] diamond[10119]: segfault at 7fc4953d4620 ip 00007fc4955de52d sp 00007fffc99561d0... [20:57:47] milimetric, yt? [20:57:53] ype [20:57:57] hi mforns [20:57:58] hey :] [20:58:06] I saw nuria merged the EL change [20:58:07] how long you workin today? [20:58:14] yes [20:58:15] for an hour and a half more [20:58:21] I have a question for milimetric [20:58:29] do you want to backfill? [20:58:36] ok, mforns, so you wanna deploy and start backfilling? [20:58:37] sure [20:58:41] nuria: what's up [20:58:51] ok [20:58:57] (i'll be in the batcave [20:58:58] ) [20:59:03] the whitelist check touches teh same files you just merged for pageview_hourly [20:59:43] milimetric: the check was happening before as part of "mark dataset done" [20:59:59] milimetric: https://gerrit.wikimedia.org/r/#/c/240099/10/oozie/pageview/hourly/workflow.xml [21:00:30] ok [21:00:42] milimetric: i should put it there still right? [21:01:25] sorry so what's this whitelist thing? [21:01:29] nuria: wanna come to the batcave? [21:01:36] milimetric: yes [21:03:33] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [21:08:54] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [22:05:55] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744036 (GWicke) We are having a hangout meeting tomorrow (Thursday, 22nd) between 11&12am SF time. Please let us know if you'd like to join. [22:28:08] halfak: yt? [22:28:45] o/ Ori [22:28:47] What's up? [22:29:31] halfak: apart from edits explicitly marked "minor", is there anything like a formal definition of what a small, "wikignome" edit is? [22:29:38] something like a character count [22:30:02] ori, I don't have anything like that, but I am in the middle of working on an edit type classifier. [22:30:16] oh, cool! [22:30:34] It's pretty easy to detect simple mechanical things like "Adds a wikilink" or even "Disambiguates a wikilink" [22:30:50] but it is hard to know if an edit rephrased a sentence or changed its meaning. [22:30:59] right [22:31:22] this is not urgent, so i'll just try to keep up with your work [22:33:28] Maybe you could tell me about the use-case and I could make some proposals. [22:33:30] ori, ^ [22:34:32] halfak: we think performance may correlate with wikignome productivity [22:34:46] we (=performance team) don't have the bandwidth to conduct a full experiment for now, but it's something that we'd like to have instrumented [22:35:07] so i just wanted some simple counter metric in graphite that tracks the rate at which such edits are being made [22:36:32] ori, I'd look for tool-based edits more than general wikignoming. [22:36:37] That we're good at detecting [22:36:54] halfak: how do we detect those? user-agent, bot flag? [22:36:59] https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Automated_tool_and_bot_edits [22:37:06] We use regexes on the edit comments. [22:37:16] It's not brilliant, but it is effective [22:37:34] That query is about as inefficient as they come. [22:37:50] But we can use the regexes and build a better one. [22:39:19] nod [22:39:24] thanks, that's very helpful! [22:39:46] I'll see if I can use the labeled data we have about edit types to build some good proxies too. In case I'm wrong. [22:39:51] We should be publishing that data anyway. [22:43:54] good night a-team! [22:44:55] Analytics-Backlog, operations, Patch-For-Review: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1744196 (chasemp) Open>Resolved This has been resolved with an exception allowing the file_mover UID [22:45:11] good night mforns! [22:51:30] Analytics-Kanban: EventLogging mysql consumer can be killed by a bad event - https://phabricator.wikimedia.org/T116241#1744214 (Milimetric) NEW a:Milimetric [22:52:07] Analytics-Kanban, Patch-For-Review, WMF-deploy-2015-10-27_(1.27.0-wmf.4): Incident: EventLogging mysql consumer stopped consuming from kafka {oryx} - https://phabricator.wikimedia.org/T115667#1729697 (Milimetric) [22:52:08] Analytics-Kanban: EventLogging mysql consumer can be killed by a bad event - https://phabricator.wikimedia.org/T116241#1744214 (Milimetric) [22:52:19] blocked on backfilling, details in those two updates above ^ [22:52:27] abandoning for tonight [22:53:14] !log deployed EventLogging and tried to backfill data lost on 2015.10.14 but failed [22:53:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [22:53:45] nite all [23:01:30] anybody here could answer a couple of questions abut spark & analytics cluster? [23:12:33] Analytics: Update reportcard.wmflabs.org with September data - https://phabricator.wikimedia.org/T116244#1744291 (Tbayer) NEW [23:18:09] Analytics-Wikistats: Discrepancies in historical total active editor numbers - https://phabricator.wikimedia.org/T87738#1744306 (Tbayer) The historical numbers for the same month continue to jump up and down merrily, by very implausible amounts, e.g.: | |given [[https://web.archive.org/web/20151021210417/ht... [23:21:17] Analytics, Services, operations: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1744311 (mobrovac) NEW [23:22:42] Analytics, Services, operations: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1744321 (mobrovac) [23:22:45] Analytics-Kanban, RESTBase, Services, Patch-For-Review: configure RESTBase pageview proxy to Analytics' cluster {slug} [3 pts] - https://phabricator.wikimedia.org/T114830#1744320 (mobrovac) [23:24:34] Analytics-Kanban, RESTBase, Services, Patch-For-Review, RESTBase-API: configure RESTBase pageview proxy to Analytics' cluster {slug} [3 pts] - https://phabricator.wikimedia.org/T114830#1707320 (mobrovac) [23:46:08] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (GWicke) NEW [23:46:17] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (GWicke) p:Normal>High [23:48:12] Analytics-Kanban, operations, Monitoring, Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1744422 (chasemp) @ottomata ping me tomorrow if you can, I'd like to give helping a whirl [23:49:54] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744425 (GWicke) [23:50:51] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (GWicke) [23:52:17] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744440 (GWicke) [23:52:44] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1744446 (GWicke) [23:53:36] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (GWicke)