[03:00:09] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [03:02:01] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [07:01:01] (CR) Ricordisamoa: "I'm curious to see how this relates to https://github.com/valhallasw/flask-mwoauth" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/222841 (owner: Yuvipanda) [11:53:31] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429316 (Aklapper) >>! In T103292#1421157, @Dicortazar wrote: > Data in the SCR overview page wer... [12:01:34] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Dicortazar) NEW [12:02:45] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429347 (Dicortazar) @Aklapper, I've created {T104845}. I've CC'ed you and @Qgil. Please, feel fr... [12:03:13] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Dicortazar) [12:03:16] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429349 (Dicortazar) [12:16:33] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429392 (Qgil) [12:19:38] Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429404 (Qgil) > Missing Git repositories in the list of Gerrit projects (some Git repos are out of the current review process) I would rem... [13:12:17] Good morning ottomata ! [13:12:42] Tell me when your coffee level is up enough ;) [13:14:26] :) i am about to look at the projectview thing! [13:14:45] joal: I am here and revin up [13:16:37] ok, just changed perms, i don't think i can give you perms...it is a gerri thing [13:16:38] gerrit* [13:16:45] ottomata: ok, np [13:16:48] i will run the script manually as stats now [13:17:03] I think it would be enough to just push the commit [13:17:16] oh [13:17:30] Get into /a/aggregator/projectview/data and git push origin master [13:17:45] Should be enough [13:17:50] yup! [13:17:51] as stats of course [13:17:52] that pushed [13:17:57] awesome [13:18:45] I'll check tomorrow for new data in vitalsign, and for last push in aggregator [13:18:48] ok cool [13:18:49] Thx a lot :) [13:18:58] how often does vitalsign pull happen? [13:19:02] should we do that part manualy now too [13:19:02] ? [13:19:05] I can't remember [13:19:09] :) [13:19:13] Would bve good yeah [13:19:17] me neither, maybe dan can do that for us later [13:20:26] yup [13:20:44] The puppet code says: ensure => 'latest' [13:21:00] Does that mean that each puppet run will check fpr new commits ? [13:24:48] yes [13:31:37] ottomata: then it should work by itself ! [13:31:49] indeed :) [13:33:25] Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1429635 (Ottomata) NEW a:Cmjohnson [13:33:56] Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1429646 (Ottomata) Related: T95263 [13:49:06] ottomata: do you remember if puppet agent is disabled on wikimetrics ? [13:49:22] cause change has not happened in data yet [13:49:24] joal: i do not [13:49:28] k [13:49:39] I'll wait for milimetric :) [14:16:30] (PS1) Mforns: Add agrgegation across projects [analytics/aggregator] - https://gerrit.wikimedia.org/r/223031 (https://phabricator.wikimedia.org/T95339) [14:33:58] Good morning halfak :) [14:34:05] o/ joal [14:34:13] Has your diff job finally finished ? [14:36:57] * halfak checks [14:37:29] Looks like we are still running. [14:37:54] 805/2623 maps completed. [14:38:02] Hmm... I thought we went faster last time. [14:38:38] :( [14:38:58] compression differences ? [14:39:55] Could be. You'd think snappy would beat BZ2 [14:41:08] decompresion-time, corect, but split (and therefore number of maps) wise, not at all [14:41:11] Quarry: Unicode in query results in strange behavior - https://phabricator.wikimedia.org/T71224#1429869 (Halfak) Open>Resolved [14:41:19] Quarry: Unicode in query results in strange behavior - https://phabricator.wikimedia.org/T71224#742694 (Halfak) Yup. Works for me now. [14:42:56] joal, interesting. I thought that snappy was splitable. [14:43:40] sorry, bad explanation: very splittable for sure, but also very much bigger (3 to 5 times) than bz2 [14:44:40] Gotcha. That's a shame. I suppose we're also keeping whole pages. [14:44:56] Before, we split more evenly. [14:45:19] I could try recompressing at this rate though. [14:55:09] halfak: I could also provide a parameterized compression scheme for json conversion [14:55:32] But actually, the number of maps would not change [14:55:53] This number come from the number of splits worked in json conversion [14:56:22] joal, I suppose that a major performance benefit from our last run could be the within-page splitting. [14:58:06] Not really applicable here I think, since rev extraction needs page info, therefore full page treatment [14:58:35] But what I wonder is why only 2500 maps, when the files should be splitted [15:02:09] joal, well, it seems that each mapper gets a whole file which corresponds to whole pages. [15:02:13] This was not true before. [15:03:23] hm, that was not expected [15:09:25] halfak: how did you split you result files when generating json using python ? [15:34:33] joal, I kept the input filenames in tact and replaced "xml" with "json". So I had 172 input XML files and got 172 output json bz2 files [15:37:09] halfack: and you got more than 172 mappers ? [15:47:39] Yes joal. Let's see if I have it in my notes. [15:47:50] hmmmm [15:48:03] Nope. [15:48:07] Didn't record that stat [15:48:23] Analytics-Kanban, MediaWiki-extensions-ExtensionDistributor, Patch-For-Review: Set up graphs and dumps for ExtensionDistributor download statistics {frog} [3 pts] - https://phabricator.wikimedia.org/T101194#1430178 (Milimetric) If you look at the first column on our main board: https://phabricator.wiki... [16:00:39] halfak: I don't know streaming enough and didn't find in the doc nor using google if streaming processes entire files in mappers or if it splits them [16:01:32] halfak: Since it uses a FileInputFormat specifying if files can be splitted or not, I suspect it can split them, but then, why not in that specific snappy case :( [16:01:52] joal, it splits them. [16:02:15] I think the snappy files outnumber the potential mappers. [16:02:25] ? [16:03:05] I am not sure to follow you on this one halfak [16:04:01] I suspect that when the files outnumber the potential mappers, files don't get split. [16:05:35] halfak: I'll double check on that, but I don't think it works this way [16:05:45] halfak: in meeting right now, but after [16:11:12] joal, same! [16:13:59] Analytics, Analytics-Backlog: Reportupdater: put history and pid files inside the project folder [5 pts] {lamb} - https://phabricator.wikimedia.org/T103385#1430311 (Milimetric) [16:14:01] Analytics-Kanban, Reading-Web: Cron on stat1003 for mobile data is causing an avalanche of queries on dbstore1002 - https://phabricator.wikimedia.org/T103798#1430310 (Milimetric) [16:14:03] Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} - https://phabricator.wikimedia.org/T104379#1430309 (Milimetric) [16:14:14] Analytics-Kanban: Spike: gather requirements to implement unique tokens {bull} - https://phabricator.wikimedia.org/T101784#1430323 (kevinator) meeting notes and very very draft doc are here: https://office.wikimedia.org/wiki/Analytics/Unique_Tokens doc is not ready for public scrutiny yet... we'll create mor... [16:14:21] Analytics-Kanban: Spike: gather requirements to implement unique tokens {bull} - https://phabricator.wikimedia.org/T101784#1430325 (kevinator) Open>Resolved [16:17:43] Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} - https://phabricator.wikimedia.org/T104379#1430361 (Milimetric) [16:19:58] Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} [13 pts] - https://phabricator.wikimedia.org/T104379#1430386 (Milimetric) [16:22:38] Analytics-Backlog, Analytics-EventLogging: Load test parallel eventlogging-processor {stag} - https://phabricator.wikimedia.org/T104229#1430415 (Milimetric) [16:26:37] Analytics-Backlog, Analytics-EventLogging: Load test parallel eventlogging-processor {stag} [3 pts] - https://phabricator.wikimedia.org/T104229#1430458 (Milimetric) [16:28:24] Analytics-Backlog: Sanitize aggregated data presented in VitalSign using K-Anonymity - https://phabricator.wikimedia.org/T104485#1430467 (Milimetric) [16:42:35] Analytics-Backlog: Sanitize aggregated data presented in VitalSign using K-Anonymity {musk} [8 pts] - https://phabricator.wikimedia.org/T104485#1430567 (Milimetric) [17:15:15] Analytics-Backlog: Enforce policy for each schema: Sanitize {tick} - https://phabricator.wikimedia.org/T104877#1430750 (Milimetric) NEW [17:21:35] Analytics-Backlog: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1430795 (Milimetric) [17:22:52] Analytics-Backlog, Analytics-Dashiki: Improve the edit analysis dashboard {lion} - https://phabricator.wikimedia.org/T104261#1430801 (Milimetric) [17:41:08] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [17:43:00] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [17:48:23] milimetric: quick brain bounce? [17:49:09] mforns: you might be able to help me too :) [17:49:10] ottomata: finishing meeting [17:49:16] ah k [17:49:19] k lemme know [17:49:46] ottomata, milimetric, I need to go in 11mins, gym. is this urgent? [17:49:51] nope [17:50:05] go! [17:50:06] :) [18:00:35] Analytics-Backlog, Wikimania-Hackathon-2015: Dockerize Hadoop Cluster, Druid, and Samza + Load Test - https://phabricator.wikimedia.org/T102980#1431042 (Milimetric) [18:02:29] hey ottomata, cave? [18:02:42] k [18:02:45] cpoming [18:03:05] halfak: I don't get why the files are not splitted by hadoop :( [18:03:07] hmm, on phone internet [18:03:17] milimetric: i am super close to home, walking, be there in 5 [18:03:23] I'll have a more detailed look tomorrow, but doens't make sense to me [18:03:28] np [18:03:30] i'm around [18:03:34] I'm off fopr today lads [18:03:38] kk thanks joal [18:03:40] See you tomorrow ! [18:03:43] o/ [18:03:45] halfak: np ; [18:36:55] since http://pentaho.wmflabs.org/ is down, what's currently the best place to get new def pageview data? [19:58:29] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [20:00:39] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [20:10:48] Analytics-Backlog, Deployment-Systems, Performance-Team, operations, Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1431847 (Krinkle) [20:14:05] Analytics-Backlog, Deployment-Systems, Performance-Team, operations, Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1431863 (Krinkle) Presumably by adding a tail subscriber to the varnish stream. Basically we'd collect... [20:17:12] mforns: you still around? [20:17:17] milimetric, yea [20:17:20] something's really weird with these 2 minute EL events [20:17:37] milimetric, ok, gonna look into it [20:17:38] ottomata: any clue about that? [20:17:48] I don't know what we'd look into... [20:17:55] mmm [20:17:56] the raw rate drops [20:18:10] so the alerts get their averages messed up [20:18:20] milimetric, are you talking about this exact last 2 minutes? [20:18:26] the validation keeps up fine, so it looks like something with varnish or something [20:18:43] there are a dozen or so events like this since Friday when I first saw it [20:20:08] Analytics, Analytics-Kanban: Reportupdater: put history and pid files inside the project folder [5 pts] {lamb} - https://phabricator.wikimedia.org/T103385#1431870 (mforns) a:mforns [20:20:32] i'm looking at graphs [20:20:42] i don't see a drop in raw rate (maybe i'm looking wrong [20:21:22] am looking here [20:21:23] http://grafana.wikimedia.org/#/dashboard/db/eventlogging [20:21:24] :) [20:23:21] also, the alert is specifically about raw/valid proportion [20:23:25] you don't get spikes of invalid events? [20:23:30] ottomata, I don't see any drop, I'm looking in graphite>eventlogging>overall>raw>sum [20:24:41] I see some differences between raw.sum and valid.sum, concentrated in short spikes [20:27:27] Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1431916 (Cmjohnson) Open>Resolved Fixed. Idrac license was missing [20:36:58] mforns / ottomata: sorry it took me so long, graphite's UI is a little confusing [20:37:11] You can see these violent downspikes here: http://graphite.wikimedia.org/dashboard/#eventlogging.raw-vs-valid [20:37:32] if you zoom out and look at data from last thursday to today you can see there's a pattern [20:37:39] they seem to happen between 18:00 and 19:00 UTC [20:37:44] at least in a couple of instances [20:37:49] and then there are a bunch of random ones [20:38:19] milimetric, aha [20:38:33] milimetric, what is exactly raw.rate and valid.rate? [20:39:01] if raw drops, that means that events inserted into the raw tcp queue drop [20:39:02] that's what we've always looked at, I think it's events per second [20:39:09] valid would drop if raw drops too [20:39:17] since raw is the source of valid events [20:39:22] ottomata: that graph shows both dropping [20:39:24] yes [20:39:32] raw comes originally from varnishncsa via udp [20:39:44] so a drop could be caused by varnishncsa restarting everywhere [20:39:49] right, so either there's a problem with that or with the reporter... [20:39:50] milimetric, aha, makes sense events per second [20:39:53] or, it also comes from forwarder [20:40:00] so it coudl be caused by forwarder restarting [20:40:13] puppet runs... maybe? [20:40:14] or ja, reporter too i guess, coudl be restarted and cause that [20:40:16] maybe? [20:40:34] i think puppet tries to restart eventlogging on config file changes...not sure though [20:43:09] ottomata, milimetric, I don't think it's related to puppet, otherwise it would happen during the low traffic, but it happens only in high traffic [20:44:29] maybe, unless there is some weird puppet thing that is making it happen periodically [20:44:34] but i also doubt puppt [20:44:39] puppet [20:44:59] I recall having spotted some particular invalid events during those validation spikes in the past, related to very long events, containing special characters that URIEncoded became a very long log, that was truncated by varnishncsa, and later on failed validation for the same reason [20:45:13] will try to look for those again [20:46:14] hm... these are massive drops that make me suspect a more systemic problem like varnish restarts [20:46:45] also, the raw rate wouldn't drop if it was a truncated event problem [21:58:48] Analytics, Pywikibot-compat-to-core: Measure current usage of pywikibot-compat vs pywikibot-core - https://phabricator.wikimedia.org/T99373#1432298 (Aklapper) > With the July 1 API breakage about to happen, This has been [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_prior... [22:03:04] Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1432304 (Krinkle) Resolved>Open We ran into some issues with the metrics. Re-opening as reminder to investigate and address. [22:17:41] bye, team, see ya tomorrow! [22:57:32] byaaa [22:57:35] heya milimetric, [22:57:46] this is ready for preliminary review. [22:57:46] https://gerrit.wikimedia.org/r/#/c/222064/ [22:57:59] something is broken with it though. somehow between that patch and a previous one some of the kafka writer stuff isnt' working [22:58:14] and i didn't catch that while deving it because i wasn't testing the multiprocessing stuff with kafka until I got this far [22:58:19] soooo, hm. will ahve to figure that out [22:58:28] but, the overall structure is ready for review [23:38:07] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [23:38:58] i still don't know how to respond to those alerts :\ [23:39:07] if someone wants to clue me in that would be great [23:39:34] especially since they happen nearly every day [23:39:58] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [23:40:16] alternately if you think the thresholds should be adjusted to avoid false positives, please tell me that [23:40:33] (i get paged for these)