[03:00:09] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[03:02:01] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[07:01:01] <grrrit-wm>	 (CR) Ricordisamoa: "I'm curious to see how this relates to https://github.com/valhallasw/flask-mwoauth" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/222841 (owner: Yuvipanda)
[11:53:31] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429316 (Aklapper) >>! In T103292#1421157, @Dicortazar wrote: > Data in the SCR overview page wer...
[12:01:34] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Dicortazar) NEW
[12:02:45] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429347 (Dicortazar) @Aklapper, I've created {T104845}. I've CC'ed you and @Qgil. Please, feel fr...
[12:03:13] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Dicortazar)
[12:03:16] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429349 (Dicortazar)
[12:16:33] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1429392 (Qgil)
[12:19:38] <wikibugs>	 Analytics-Tech-community-metrics, Engineering-Community, ECT-July-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429404 (Qgil) > Missing Git repositories in the list of Gerrit projects (some Git repos are out of the current review process)  I would rem...
[13:12:17] <joal>	 Good morning ottomata !
[13:12:42] <joal>	 Tell me when your coffee level is up enough ;)
[13:14:26] <ottomata>	 :) i am about to look at the projectview thing!
[13:14:45] <ottomata>	 joal:  I am here and revin up
[13:16:37] <ottomata>	 ok, just changed perms, i don't think i can give you perms...it is a gerri thing
[13:16:38] <ottomata>	 gerrit*
[13:16:45] <joal>	 ottomata: ok, np
[13:16:48] <ottomata>	 i will run the script manually as stats now
[13:17:03] <joal>	 I think it would be enough to just push the commit
[13:17:16] <ottomata>	 oh
[13:17:30] <joal>	 Get into /a/aggregator/projectview/data and git push origin master
[13:17:45] <joal>	 Should be enough
[13:17:50] <ottomata>	 yup!
[13:17:51] <joal>	 as stats of course
[13:17:52] <ottomata>	 that pushed
[13:17:57] <joal>	 awesome
[13:18:45] <joal>	 I'll check tomorrow for new data in vitalsign, and for last push in aggregator
[13:18:48] <ottomata>	 ok cool
[13:18:49] <joal>	 Thx a lot :)
[13:18:58] <ottomata>	 how often does vitalsign pull happen?
[13:19:02] <ottomata>	 should we do that part manualy now too
[13:19:02] <ottomata>	 ?
[13:19:05] <joal>	 I can't remember
[13:19:09] <ottomata>	 :)
[13:19:13] <joal>	 Would bve good yeah
[13:19:17] <ottomata>	 me neither, maybe dan can do that for us later
[13:20:26] <joal>	 yup
[13:20:44] <joal>	 The puppet code says: ensure => 'latest'
[13:21:00] <joal>	 Does that mean that each puppet run will check fpr new commits ?
[13:24:48] <ottomata>	 yes
[13:31:37] <joal>	 ottomata: then it should work by itself !
[13:31:49] <ottomata>	 indeed :)
[13:33:25] <wikibugs>	 Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1429635 (Ottomata) NEW a:Cmjohnson
[13:33:56] <wikibugs>	 Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1429646 (Ottomata) Related: T95263
[13:49:06] <joal>	 ottomata: do you remember if puppet agent is disabled on wikimetrics ?
[13:49:22] <joal>	 cause change has not happened in data yet
[13:49:24] <ottomata>	 joal: i do not
[13:49:28] <joal>	 k
[13:49:39] <joal>	 I'll wait for milimetric :)
[14:16:30] <grrrit-wm>	 (PS1) Mforns: Add agrgegation across projects [analytics/aggregator] - https://gerrit.wikimedia.org/r/223031 (https://phabricator.wikimedia.org/T95339)
[14:33:58] <joal>	 Good morning halfak :)
[14:34:05] <halfak>	 o/ joal
[14:34:13] <joal>	 Has your diff job finally finished ?
[14:36:57] * halfak checks
[14:37:29] <halfak>	 Looks like we are still running.
[14:37:54] <halfak>	 805/2623 maps completed.
[14:38:02] <halfak>	 Hmm... I thought we went faster last time.
[14:38:38] <joal>	 :(
[14:38:58] <joal>	 compression differences ?
[14:39:55] <halfak>	 Could be.  You'd think snappy would beat BZ2
[14:41:08] <joal>	 decompresion-time, corect, but split (and therefore number of maps) wise, not at all
[14:41:11] <wikibugs>	 Quarry: Unicode in query results in strange behavior - https://phabricator.wikimedia.org/T71224#1429869 (Halfak) Open>Resolved
[14:41:19] <wikibugs>	 Quarry: Unicode in query results in strange behavior - https://phabricator.wikimedia.org/T71224#742694 (Halfak) Yup.  Works for me now.
[14:42:56] <halfak>	 joal, interesting.  I thought that snappy was splitable.
[14:43:40] <joal>	 sorry, bad explanation: very splittable for sure, but also very much bigger (3 to 5 times) than bz2
[14:44:40] <halfak>	 Gotcha.  That's a shame.  I suppose we're also keeping whole pages.
[14:44:56] <halfak>	 Before, we split more evenly.
[14:45:19] <halfak>	 I could try recompressing at this rate though.
[14:55:09] <joal>	 halfak: I could also provide a parameterized compression scheme for json conversion
[14:55:32] <joal>	 But actually, the number of maps would not change
[14:55:53] <joal>	 This number come from the number of splits worked in json conversion
[14:56:22] <halfak>	 joal, I suppose that a major performance benefit from our last run could be the within-page splitting.
[14:58:06] <joal>	 Not really applicable here I think, since rev extraction needs page info, therefore full page treatment
[14:58:35] <joal>	 But what I wonder is why only 2500 maps, when the files should be splitted
[15:02:09] <halfak>	 joal, well, it seems that each mapper gets a whole file which corresponds to whole pages.
[15:02:13] <halfak>	 This was not true before.
[15:03:23] <joal>	 hm, that was not expected
[15:09:25] <joal>	 halfak: how did you split you result files when generating json using python ?
[15:34:33] <halfak>	 joal, I kept the input filenames in tact and replaced "xml" with "json".  So I had 172 input XML files and got 172 output json bz2 files
[15:37:09] <joal>	 halfack: and you got more than 172 mappers ?
[15:47:39] <halfak>	 Yes joal.  Let's see if I have it in my notes.
[15:47:50] <joal>	 hmmmm
[15:48:03] <halfak>	 Nope.
[15:48:07] <halfak>	 Didn't record that stat
[15:48:23] <wikibugs>	 Analytics-Kanban, MediaWiki-extensions-ExtensionDistributor, Patch-For-Review: Set up graphs and dumps for ExtensionDistributor download statistics {frog} [3 pts] - https://phabricator.wikimedia.org/T101194#1430178 (Milimetric) If you look at the first column on our main board: https://phabricator.wiki...
[16:00:39] <joal>	 halfak: I don't know streaming enough and didn't find in the doc nor using google if streaming processes entire files in mappers or if it splits them
[16:01:32] <joal>	 halfak: Since it uses a FileInputFormat specifying if files can be splitted or not, I suspect it can split them, but then, why not in that specific snappy case :(
[16:01:52] <halfak>	 joal, it splits them.
[16:02:15] <halfak>	 I think the snappy files outnumber the potential mappers.
[16:02:25] <joal>	 ?
[16:03:05] <joal>	 I am not sure to follow you on this one halfak
[16:04:01] <halfak>	 I suspect that when the files outnumber the potential mappers, files don't get split.
[16:05:35] <joal>	 halfak: I'll double check on that, but I don't think it works this way
[16:05:45] <joal>	 halfak: in meeting right now, but after
[16:11:12] <halfak>	 joal, same!
[16:13:59] <wikibugs>	 Analytics, Analytics-Backlog: Reportupdater: put history and pid files inside the project folder [5 pts] {lamb} - https://phabricator.wikimedia.org/T103385#1430311 (Milimetric)
[16:14:01] <wikibugs>	 Analytics-Kanban, Reading-Web: Cron on stat1003 for mobile data is causing an avalanche of queries on dbstore1002 - https://phabricator.wikimedia.org/T103798#1430310 (Milimetric)
[16:14:03] <wikibugs>	 Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} - https://phabricator.wikimedia.org/T104379#1430309 (Milimetric)
[16:14:14] <wikibugs>	 Analytics-Kanban: Spike: gather requirements to implement unique tokens {bull} - https://phabricator.wikimedia.org/T101784#1430323 (kevinator) meeting notes and very very draft doc are here: https://office.wikimedia.org/wiki/Analytics/Unique_Tokens  doc is not ready for public scrutiny yet... we'll create mor...
[16:14:21] <wikibugs>	 Analytics-Kanban: Spike: gather requirements to implement unique tokens {bull} - https://phabricator.wikimedia.org/T101784#1430325 (kevinator) Open>Resolved
[16:17:43] <wikibugs_>	 Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} - https://phabricator.wikimedia.org/T104379#1430361 (Milimetric)
[16:19:58] <wikibugs_>	 Analytics-Backlog: Clean up mobile-reportcard dashboards {frog} [13 pts] - https://phabricator.wikimedia.org/T104379#1430386 (Milimetric)
[16:22:38] <wikibugs>	 Analytics-Backlog, Analytics-EventLogging: Load test parallel eventlogging-processor  {stag} - https://phabricator.wikimedia.org/T104229#1430415 (Milimetric)
[16:26:37] <wikibugs>	 Analytics-Backlog, Analytics-EventLogging: Load test parallel eventlogging-processor  {stag} [3 pts] - https://phabricator.wikimedia.org/T104229#1430458 (Milimetric)
[16:28:24] <wikibugs>	 Analytics-Backlog: Sanitize aggregated data presented in VitalSign using K-Anonymity - https://phabricator.wikimedia.org/T104485#1430467 (Milimetric)
[16:42:35] <wikibugs_>	 Analytics-Backlog: Sanitize aggregated data presented in VitalSign using K-Anonymity {musk} [8 pts] - https://phabricator.wikimedia.org/T104485#1430567 (Milimetric)
[17:15:15] <wikibugs>	 Analytics-Backlog: Enforce policy for each schema: Sanitize {tick} - https://phabricator.wikimedia.org/T104877#1430750 (Milimetric) NEW
[17:21:35] <wikibugs>	 Analytics-Backlog: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1430795 (Milimetric)
[17:22:52] <wikibugs>	 Analytics-Backlog, Analytics-Dashiki: Improve the edit analysis dashboard {lion} - https://phabricator.wikimedia.org/T104261#1430801 (Milimetric)
[17:41:08] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[17:43:00] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[17:48:23] <ottomata>	 milimetric:  quick brain bounce?
[17:49:09] <ottomata>	 mforns:  you might be able to help me too :)
[17:49:10] <milimetric>	 ottomata: finishing meeting
[17:49:16] <ottomata>	 ah k
[17:49:19] <ottomata>	 k lemme know
[17:49:46] <mforns>	 ottomata, milimetric, I need to go in 11mins, gym. is this urgent?
[17:49:51] <ottomata>	 nope
[17:50:05] <ottomata>	 go!
[17:50:06] <ottomata>	 :)
[18:00:35] <wikibugs>	 Analytics-Backlog, Wikimania-Hackathon-2015: Dockerize Hadoop Cluster, Druid, and Samza + Load Test - https://phabricator.wikimedia.org/T102980#1431042 (Milimetric)
[18:02:29] <milimetric>	 hey ottomata, cave?
[18:02:42] <ottomata>	 k
[18:02:45] <ottomata>	 cpoming
[18:03:05] <joal>	 halfak: I don't get why the files are not splitted by hadoop :(
[18:03:07] <ottomata>	 hmm, on phone internet
[18:03:17] <ottomata>	 milimetric: i am super close to home, walking, be there in 5
[18:03:23] <joal>	 I'll have a more detailed look tomorrow, but doens't make sense to me
[18:03:28] <milimetric>	 np
[18:03:30] <milimetric>	 i'm around
[18:03:34] <joal>	 I'm off fopr today lads
[18:03:38] <halfak>	 kk thanks joal
[18:03:40] <joal>	 See you tomorrow !
[18:03:43] <halfak>	 o/
[18:03:45] <joal>	 halfak: np ;
[18:36:55] <HaeB>	 since http://pentaho.wmflabs.org/ is down, what's currently the best place to get new def pageview data?
[19:58:29] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[20:00:39] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[20:10:48] <wikibugs>	 Analytics-Backlog, Deployment-Systems, Performance-Team, operations, Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1431847 (Krinkle)
[20:14:05] <wikibugs>	 Analytics-Backlog, Deployment-Systems, Performance-Team, operations, Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1431863 (Krinkle) Presumably by adding a tail subscriber to the varnish stream.  Basically we'd collect...
[20:17:12] <milimetric>	 mforns: you still around?
[20:17:17] <mforns>	 milimetric, yea
[20:17:20] <milimetric>	 something's really weird with these 2 minute EL events
[20:17:37] <mforns>	 milimetric, ok, gonna look into it
[20:17:38] <milimetric>	 ottomata: any clue about that?
[20:17:48] <milimetric>	 I don't know what we'd look into...
[20:17:55] <mforns>	 mmm
[20:17:56] <milimetric>	 the raw rate drops
[20:18:10] <milimetric>	 so the alerts get their averages messed up
[20:18:20] <mforns>	 milimetric, are you talking about this exact last 2 minutes?
[20:18:26] <milimetric>	 the validation keeps up fine, so it looks like something with varnish or something
[20:18:43] <milimetric>	 there are a dozen or so events like this since Friday when I first saw it
[20:20:08] <wikibugs>	 Analytics, Analytics-Kanban: Reportupdater: put history and pid files inside the project folder [5 pts] {lamb} - https://phabricator.wikimedia.org/T103385#1431870 (mforns) a:mforns
[20:20:32] <ottomata>	 i'm looking at graphs
[20:20:42] <ottomata>	 i don't see a drop in raw rate (maybe i'm looking wrong
[20:21:22] <ottomata>	 am looking here
[20:21:23] <ottomata>	 http://grafana.wikimedia.org/#/dashboard/db/eventlogging
[20:21:24] <ottomata>	 :)
[20:23:21] <ottomata>	 also, the alert is specifically about raw/valid proportion
[20:23:25] <ottomata>	 you don't get spikes of invalid events?
[20:23:30] <mforns>	 ottomata, I don't see any drop, I'm looking in graphite>eventlogging>overall>raw>sum
[20:24:41] <mforns>	 I see some differences between raw.sum and valid.sum, concentrated in short spikes
[20:27:27] <wikibugs>	 Analytics-Cluster, operations, ops-eqiad: analytics1020 down - https://phabricator.wikimedia.org/T104856#1431916 (Cmjohnson) Open>Resolved Fixed.  Idrac license was missing
[20:36:58] <milimetric>	 mforns / ottomata: sorry it took me so long, graphite's UI is a little confusing
[20:37:11] <milimetric>	 You can see these violent downspikes here: http://graphite.wikimedia.org/dashboard/#eventlogging.raw-vs-valid
[20:37:32] <milimetric>	 if you zoom out and look at data from last thursday to today you can see there's a pattern
[20:37:39] <milimetric>	 they seem to happen between 18:00 and 19:00 UTC
[20:37:44] <milimetric>	 at least in a couple of instances
[20:37:49] <milimetric>	 and then there are a bunch of random ones
[20:38:19] <mforns>	 milimetric, aha
[20:38:33] <mforns>	 milimetric, what is exactly raw.rate and valid.rate?
[20:39:01] <ottomata>	 if raw drops, that means that events inserted into the raw tcp queue drop
[20:39:02] <milimetric>	 that's what we've always looked at, I think it's events per second
[20:39:09] <ottomata>	 valid would drop if raw drops too
[20:39:17] <ottomata>	 since raw is the source of valid events
[20:39:22] <milimetric>	 ottomata: that graph shows both dropping
[20:39:24] <ottomata>	 yes
[20:39:32] <ottomata>	 raw comes originally from varnishncsa via udp
[20:39:44] <ottomata>	 so a drop could be caused by varnishncsa restarting everywhere
[20:39:49] <milimetric>	 right, so either there's a problem with that or with the reporter...
[20:39:50] <mforns>	 milimetric, aha, makes sense events per second
[20:39:53] <ottomata>	 or, it also comes from forwarder
[20:40:00] <ottomata>	 so it coudl be caused by forwarder restarting
[20:40:13] <milimetric>	 puppet runs... maybe?
[20:40:14] <ottomata>	 or ja, reporter too i guess, coudl be restarted and cause that
[20:40:16] <ottomata>	 maybe?
[20:40:34] <ottomata>	 i think puppet tries to restart eventlogging on config file changes...not sure though
[20:43:09] <mforns>	 ottomata, milimetric, I don't think it's related to puppet, otherwise it would happen during the low traffic, but it happens only in high traffic
[20:44:29] <ottomata>	 maybe, unless there is some weird puppet thing that is making it happen periodically
[20:44:34] <ottomata>	 but i also doubt puppt
[20:44:39] <ottomata>	 puppet
[20:44:59] <mforns>	 I recall having spotted some particular invalid events during those validation spikes in the past, related to very long events, containing special characters that URIEncoded became a very long log, that was truncated by varnishncsa, and later on failed validation for the same reason
[20:45:13] <mforns>	 will try to look for those again
[20:46:14] <milimetric>	 hm... these are massive drops that make me suspect a more systemic problem like varnish restarts
[20:46:45] <milimetric>	 also, the raw rate wouldn't drop if it was a truncated event problem
[21:58:48] <wikibugs>	 Analytics, Pywikibot-compat-to-core: Measure current usage of pywikibot-compat vs pywikibot-core - https://phabricator.wikimedia.org/T99373#1432298 (Aklapper) > With the July 1 API breakage about to happen,  This has been [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_prior...
[22:03:04] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1432304 (Krinkle) Resolved>Open We ran into some issues with the metrics. Re-opening as reminder to investigate and address.
[22:17:41] <mforns>	 bye, team, see ya tomorrow!
[22:57:32] <ottomata>	 byaaa
[22:57:35] <ottomata>	 heya milimetric,
[22:57:46] <ottomata>	 this is ready for preliminary review.
[22:57:46] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/222064/
[22:57:59] <ottomata>	 something is broken with it though.  somehow between that patch and a previous one some of the kafka writer stuff isnt' working
[22:58:14] <ottomata>	 and i didn't catch that while deving it because i wasn't testing the multiprocessing stuff with kafka until I got this far
[22:58:19] <ottomata>	 soooo, hm. will ahve to figure that out
[22:58:28] <ottomata>	 but, the overall structure is ready for review
[23:38:07] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[23:38:58] <jgage>	 i still don't know how to respond to those alerts :\
[23:39:07] <jgage>	 if someone wants to clue me in that would be great
[23:39:34] <jgage>	 especially since they happen nearly every day
[23:39:58] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[23:40:16] <jgage>	 alternately if you think the thresholds should be adjusted to avoid false positives, please tell me that
[23:40:33] <jgage>	 (i get paged for these)