[02:04:05] <wikibugs>	 10Analytics, 10EventBus, 13Patch-For-Review, 06Services (done), 05WMF-deploy-2017-01-24_(1.29.0-wmf.9): EventBus produces non-canonical page urls - https://phabricator.wikimedia.org/T155066#2963812 (10Krinkle) >>! In T155066#2950587, @Pchelolo wrote: > I've created a set of interdependent patches to solv...
[09:31:39] <wikibugs>	 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2964466 (10elukey) @Pcoombe, @jrobell, @MeganHernandez_WMF  and @spatton you should all now be able to log in https://pivot.wikimedia.org  Please read the following wiki page: https://wikitech.wi...
[10:57:45] <wikibugs>	 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2964633 (10Pcoombe) Thanks @elukey, it's working for me. This is great!
[11:40:53] <mforns>	 hi team :]
[11:42:23] <elukey>	 o/
[12:04:52] * elukey lunch!
[13:35:30] <wikibugs>	 (03PS1) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905
[13:37:53] <wikibugs>	 (03PS2) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905
[13:38:55] <elukey>	 joal: --^
[13:40:16] <wikibugs>	 (03CR) 10Joal: [C: 031] "LGTM, but no expert in SCAP, so let's have another opinion !" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey)
[13:40:19] <joal>	 elukey: --^
[13:40:23] <joal>	 :)
[13:40:25] <elukey>	 thanks!
[13:40:27] <joal>	 np !
[13:40:35] <joal>	 Thank you for handlingthe mess yesterday evening
[13:40:58] <elukey>	 ottomata did the magic with its hammer
[13:41:00] <elukey>	 :D :D :D
[13:41:03] <joal>	 elukey: I'd like a confirmation: SCAP does rolling deploy, no? (meaning, it deploys one server at a time)
[13:41:05] <elukey>	 *his
[13:41:07] <joal>	 :D
[13:41:27] <elukey>	 afaik yes when using explicit depool/re-pool, but I'll ask to Marko
[13:41:41] <joal>	 elukey: not talking about pooling here
[13:41:50] <elukey>	 ah in general
[13:41:57] <joal>	 elukey: yes, in general
[13:42:27] <joal>	 elukey: I remembered scap logging to say: deoing that one server, then when done, doing that other one etc
[13:42:28] <elukey>	 I think that it tries to maximize concurrency
[13:42:40] <joal>	 hm ... concurrency
[13:42:48] <joal>	 and how is it defined in scap elukey ?
[13:42:56] <elukey>	 I have no idea
[13:42:57] <elukey>	 :D
[13:42:59] <joal>	 :D
[13:45:00] <elukey>	 reading https://doc.wikimedia.org/mw-tools-scap/
[13:48:15] <elukey>	 maybe I was confusing scap deploy with what it is used to deploy media-wiki
[13:48:26] <elukey>	 so I am 90% sure that scap does one host at the time
[13:49:06] <joal>	 elukey: this is my understanding as well, but I'd love to have that confirmed to, let's say, 99.9% ? ;)
[13:49:31] <joal>	 elukey: We shall ask the release-eng team
[13:50:47] <elukey>	 https://doc.wikimedia.org/mw-tools-scap/scap3/architecture.html#process-model
[13:50:54] <elukey>	  Concurrency for each stage can be either completely serial or highly parallel, again depending on configuration. For fine tuning of the groups and stage concurrency, see server_groups and batch_size under Available configuration variables.
[13:51:12] <joal>	 Ahhh !
[13:51:17] <joal>	 there we go elukey :)
[13:53:23] <joal>	 so from the config, it seems aqs is actually deployed in fully parallelized fashion, correct?
[13:53:29] <joal>	 elukey: --^
[13:55:17] <elukey>	 joal: yep, and we have no canary afaics
[13:55:23] <joal>	 hm
[13:56:59] <joal>	 anyway, thanks a lot for finding that info elukey!
[13:57:20] <joal>	 elukey: I don't know why I recall scap doing one server at a time :(
[13:58:21] <elukey>	 I was kinda convinced of that too, but then I thought that it wouldn't make sense for mediawiki or bigger services
[13:58:29] <elukey>	 citoid afaics uses group_size 2
[13:58:36] <elukey>	 so batches of two hosts at the time
[13:59:04] <elukey>	 (doing all the stages)
[13:59:17] <joal>	 elukey: I was pretty sure scap could deploy in a parallel way, but was convinced it was serial by default
[14:00:12] <elukey>	 I think that we'd need the canary too
[14:00:15] <elukey>	 we usually do that
[14:00:21] <elukey>	 test on aqs1004, then proceed
[14:05:53] <wikibugs>	 (03PS3) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905
[14:06:22] <elukey>	 joal: --^
[14:06:44] <elukey>	 maybe we can debate about group_size: 1 vs group_size: 2
[14:06:52] <elukey>	 with the canary I don't see a major problem
[14:07:10] <elukey>	 it is great to think about the past deployments though :D
[14:07:27] <elukey>	 luckily we used scap-deploy --limit a lot of times
[14:07:33] <joal>	 elukey: correct
[14:07:42] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:07:46] <joal>	 elukey: I think it's because of the --limit that I recall serial deploys :)
[14:08:46] <joal>	 elukey: I think I like the conf exisitng now (canary + serial :)
[14:09:10] <joal>	 elukey: I wonder about having aqs1004 in both prod and canary - should it in a single place ?
[14:09:30] <wikibugs>	 (03CR) 10Joal: [C: 031] "Even better ! Still need to be double checked though :)" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey)
[14:09:42] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[14:14:59] <elukey>	 joal: good point! 
[14:15:36] <elukey>	 about the alarm above: I am pretty sure that the script checking the node manager state counts a temporary state returned by yarn as faulty and alarms
[14:15:41] <joal>	 elukey: refinery is deployed with scap and uses stat1002 as canary, if you wish an example 
[14:16:10] <elukey>	 yes yes you are right
[14:16:13] <elukey>	 fixing it
[14:16:55] <joal>	 elukey: about yarn, poassible - the cluster seems busy as well, so maybe one node just forgot to answer for a few seconds
[14:16:59] <wikibugs>	 (03PS4) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905
[14:21:30] <ottomata>	 hmm, joal: https://github.com/HotelsDotCom/jasvorno
[14:21:33] <ottomata>	 iiinteresting
[14:21:58] <joal>	 rooooooh ottomata!!!
[14:22:31] <joal>	 ottomata: webrequests?
[14:23:22] <joal>	 ottomata: let's imagine varnish-kafka uses that lib, how much space would we save on kafka/hdfs?
[14:23:47] <ottomata>	 i don't think we could do it from varnishkafka really, but perhaps camus could use it
[14:23:50] <ottomata>	 or whatever we choose
[14:24:47] <ottomata>	 that looks really really simple, i betcha it would be v easy to create a RecordWriterProvider for it
[14:24:56] <ottomata>	 OR, i wonder if it would make integration into kafka connect easier
[14:25:12] <ottomata>	 could be used for eventlogging stuff!
[14:25:13] <ottomata>	 :o
[14:25:57] <ottomata>	 especially if something like https://github.com/fge/json-schema-avro actually works
[14:26:53] <joal>	 ottomata: integration into kafka connect, meaning schema server etc built for us! I think I like this idea :)
[14:27:17] <joal>	 also ottomata, thanks for the tranquility thing - This data updated live is so awesome :)
[14:27:49] <joal>	 I think FRTech is going to ask for this in prod ASAP ;)
[14:28:49] <ottomata>	 we shoudl watch this spark streaming thing for a while before we commit to that :)
[14:28:58] <ottomata>	 but ya, it looks like it isn't really taking that many resources
[14:29:00] <joal>	 ottomata: for sure we should
[14:29:18] <joal>	 ottomata: so far, 4 tasks of 1Go + 1CPU each
[14:29:26] <joal>	 ottomata: and no delay
[14:29:50] <ottomata>	 ya
[14:49:34] <wikibugs>	 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2962176 (10JAllemandou) Quick data volume checks: - How much data would this dataset represent(# lines, # Go + file format + compression, # fields) - How much variabil...
[14:55:35] <ottomata>	 joal:  Go?
[14:55:42] <ottomata>	 is that a diff notation for GB?
[14:56:02] <joal>	 ottomata: it's the french version, sorry :)
[14:56:10] <ottomata>	 aye :)
[14:56:23] <joal>	 ottomata: in french, we say octect for byte
[14:56:55] <joal>	 ottomata: corrected! Thanks
[14:58:57] <wikibugs>	 10Analytics, 10Analytics-Cluster: WSC data in a cube - https://phabricator.wikimedia.org/T76093#2965294 (10Ottomata) 05Open>03declined
[15:00:03] <milimetric>	 elukey: I'm going to try and upgrade deployment-aqs01 to node 6.  The way I did it on my machine was to uninstall, purge, and reinstall from instructions on their site: https://nodejs.org/en/download/package-manager/
[15:00:12] <milimetric>	 as in curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
[15:00:46] <milimetric>	 is there some other procedure on labs, like a package repository that I don't know about?
[15:01:15] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10DBA: Purge MobileWebWikiGrok_* and MobileWebWikiGrokError_* rows older than 90 days - https://phabricator.wikimedia.org/T77918#2965310 (10Ottomata) 05Open>03declined
[15:01:54] <wikibugs>	 10Analytics, 10Analytics-Cluster: Analyst has a table of Last-Access counts {bear} - https://phabricator.wikimedia.org/T101004#2965314 (10Ottomata) 05Open>03declined
[15:02:23] <elukey>	 milimetric: nononono :)
[15:02:29] <milimetric>	 heh, ok
[15:02:34] <milimetric>	 no worries, it's why I pinged you
[15:03:13] <wikibugs>	 10Analytics, 10Wikimedia-General-or-Unknown: Browser and platform stats for logged-in vs. anon users for security and product support decisions - https://phabricator.wikimedia.org/T58575#2965317 (10Ottomata)
[15:03:33] <wikibugs>	 10Analytics, 10Analytics-Cluster: Write wikitech spark tutorial - https://phabricator.wikimedia.org/T93111#2965318 (10Ottomata) 05Open>03declined Wikitech has an eventlogging spark tutorial
[15:04:47] <elukey>	 milimetric: done :)
[15:05:02] <wikibugs>	 10Analytics, 06Labs, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821#2965325 (10Milimetric) No, we should leave it open and blocked on wikitech being set up properly.  We could of course collect pageviews via some othe...
[15:05:05] <elukey>	 apt-get update && apt-get install nodejs nodejs-dev nodejs-legacy
[15:05:18] <elukey>	 ii  nodejs                         6.9.1~dfsg-1+wmf1                amd64        evented I/O for V8 javascript
[15:05:21] <elukey>	 ii  nodejs-dev                     6.9.1~dfsg-1+wmf1                amd64        evented I/O for V8 javascript (development files)
[15:05:24] <elukey>	 ii  nodejs-legacy                  6.9.1~dfsg-1+wmf1                all          evented I/O for V8 javascript (legacy symlink)
[15:05:41] <milimetric>	 elukey: oh ok, so these machines are set up already to go against a different package manager
[15:05:55] <milimetric>	 is that public?  Can I set up my box to use it?
[15:05:58] <ottomata>	 elukey:  can you look at this?  i can't remember if this is something that affects us.  if you fixed it, we can close it
[15:05:59] <ottomata>	 https://phabricator.wikimedia.org/T71615
[15:06:23] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965334 (10Ottomata)
[15:06:26] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to only esams caches causing unknown problems - https://phabricator.wikimedia.org/T74809#2965331 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:06:31] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to depooled servers interfering with monitoring - https://phabricator.wikimedia.org/T74649#2965338 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:06:33] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#2965341 (10Ottomata)
[15:06:40] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#2965348 (10Ottomata)
[15:06:43] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to deployments gone wrong - https://phabricator.wikimedia.org/T74299#2965345 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:06:47] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965354 (10Ottomata)
[15:06:51] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#762724 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:06:54] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata)
[15:06:54] <elukey>	 milimetric: we package node in debs, you can use the wikimedia repositories like https://wikitech.wikimedia.org/w/index.php?title=APT_repository
[15:06:57] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to network issues - https://phabricator.wikimedia.org/T74298#2965356 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:07:05] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata)
[15:07:07] <wikibugs>	 10Analytics, 10Analytics-Cluster: Kafka partition leader elections causing a drop of a few log lines - https://phabricator.wikimedia.org/T72087#2965362 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:07:11] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:07:19] <milimetric>	 ooh, awesome, thanks elukey 
[15:07:28] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965375 (10Ottomata)
[15:07:30] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to oozie being overwhelmed - https://phabricator.wikimedia.org/T85704#2965372 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:07:43] <elukey>	 ottomata: never seen it before :(
[15:08:10] <wikibugs>	 10Analytics, 10Analytics-Cluster, 07Epic: Epic: Webstats Collector is replaced in Refinery - https://phabricator.wikimedia.org/T70963#2965378 (10Ottomata) 05Open>03Resolved a:03Ottomata
[15:08:12] <ottomata>	 hah, elukey yeah, its something old
[15:08:25] <ottomata>	 i'm cleaning out some analytics cluster backlog
[15:08:36] <ottomata>	 but, the ticket is related to sequence numbers vs time bucketing in hive
[15:08:57] <elukey>	 ah weird
[15:08:57] <ottomata>	 i know we worked on that in the spring/summer around the varnish upgrade, ,just can't remember off the top of my head what we did
[15:09:00] <ottomata>	 and if it is still a problem
[15:09:26] <wikibugs>	 10Analytics, 10Analytics-Cluster, 07Epic: Epic: AnalyticsEng has fully dimentionalized Page View counts - https://phabricator.wikimedia.org/T70966#2965396 (10Ottomata) 05Open>03Resolved a:03Ottomata Good ol' druid.
[15:09:33] <elukey>	 so we definitely switched timestamp, from Start to End
[15:09:39] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965404 (10Ottomata)
[15:09:42] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to single message being missing for unknown reason - https://phabricator.wikimedia.org/T76977#2965401 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them.
[15:10:09] <elukey>	 and we have also switched monitoring
[15:10:13] <elukey>	 thanks for mforns 
[15:10:30] <mforns>	 elukey, what?
[15:10:34] <elukey>	 there were tons of other issues with vk but nothing specific to timestamps other thant these IIRc
[15:10:55] <elukey>	 mforns: sorry for the extra ping, I was mentioning your work for the sequence number alarms
[15:10:58] <elukey>	 :)
[15:11:24] <mforns>	 oh! np for the ping, hehehe
[15:11:29] <wikibugs>	 10Analytics, 10Analytics-Cluster: productionizing xmldump -> avro jobs - https://phabricator.wikimedia.org/T78404#2965409 (10Ottomata) 05Open>03Invalid Not enough detail, and no real plan to do this as is.  There may be other ways of getting revision content into HDFS that we will explore in the future.
[15:11:45] <ottomata>	 hm, ok, i'll leave it open then and note that
[15:11:46] <ottomata>	 thanks elukey
[15:12:09] <ottomata>	 oh actually
[15:12:14] <ottomata>	 this ticket is exactly what mforns did i thikn
[15:12:18] <ottomata>	 fixing the validation/alarms to deal with this
[15:12:35] <elukey>	 \o/
[15:12:44] <wikibugs>	 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965417 (10Ottomata)
[15:12:47] <wikibugs>	 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: Make webrequest partition validation handle races between time and sequence numbers - https://phabricator.wikimedia.org/T71615#2965414 (10Ottomata) 05Open>03Resolved a:03Ottomata @mforns did some work to make our validation and alarms account for th...
[15:19:22] <milimetric>	 elukey: ok, aqs works perfectly fine under node 6, I'll assign the upgrade to you
[15:19:53] <wikibugs>	 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2962176 (10dcausse) I'll start, for the dimensions I'd like to have: - query_type: single valued string (# of distinct values is around 10) - syntax_used: multivalued...
[15:20:00] <wikibugs>	 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2965427 (10Milimetric) a:05Milimetric>03elukey
[15:20:35] <elukey>	 milimetric: sure! 
[15:20:36] <wikibugs>	 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2949682 (10Milimetric) Tested in beta cluster after @elukey upgraded, AQS works perfectly well on node 6.
[15:21:18] <elukey>	 milimetric: I'd like to make sure that basic host metrics like CPU etc.. does not go crazy
[15:21:44] <elukey>	 but my plan is to upgrade aqs1004, wait a day, review and complete the work
[15:22:52] <milimetric>	 sounds reasonable, yeah, I could build some fake load on beta cluster if you want to be more careful before trying it in prod
[15:22:57] <milimetric>	 let me know
[15:24:14] <elukey>	 milimetric: if you have time yes please :)
[15:25:54] <milimetric>	 ok, sure, hm... I have no comparison with old node 4 though.  So I'll just make sure it doesn't blow up under load.
[15:30:57] <elukey>	 yeah
[15:41:10] <wikibugs>	 (03CR) 10Mforns: [C: 031] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal)
[15:44:34] <wikibugs>	 (03PS1) 10Fdans: [wip] Adds map visualizer to Dashiki [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/333922 (https://phabricator.wikimedia.org/T153921)
[15:47:08] <milimetric>	 elukey: basic load test looks ok
[15:47:19] <milimetric>	 used apache bench to throw 2000 requests at it, with different concurrency
[15:47:48] <milimetric>	 memory and cpu go up normally and come back down, requests get processed ok
[15:48:08] <elukey>	 nice :)
[15:48:13] <milimetric>	 with concurrency 10 longest wait is 1 second, with concurrency 100 longest wait is 13 seconds
[15:48:29] <moritzm>	 is it know whether pivot as running on thorium is compatible with nodejs 6?
[15:48:39] <moritzm>	 it's among the systems still on nodejs 4
[15:48:50] <milimetric>	 moritzm: should be, pivot does very little in node world
[15:49:03] <milimetric>	 moritzm: I will test locally though and let you know
[15:50:01] <moritzm>	 ok, thanks :-)
[15:51:06] <milimetric>	 moritzm: yep, works fine on node 6
[15:52:00] <moritzm>	 milimetric: shall I upgrade it on thorium right away?
[15:52:16] <milimetric>	 moritzm: yep, I can test once you do, let you know if anything went wrong
[15:52:22] <moritzm>	 k, doing so
[15:52:27] <milimetric>	 thx
[15:52:54] <moritzm>	 done and restarted, let me know if anything breaks
[15:52:58] <milimetric>	 checking
[15:53:05] <wikibugs>	 (03PS5) 10Mforns: [WIP] Add banner impressions jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/331794 (https://phabricator.wikimedia.org/T155141)
[15:53:07] <wikibugs>	 10Analytics-EventLogging, 06Analytics-Kanban, 06Performance-Team, 07Performance, 07Regression: EventLogging schema modules take >1s to build (max: 22s) - https://phabricator.wikimedia.org/T150269#2965545 (10Ottomata) I just read over the code too, and I agree that locking code looks suspicious.  If a bun...
[15:54:05] <milimetric>	 looks good moritzm, thank you
[15:54:22] <moritzm>	 great, thanks :-)
[15:55:35] <elukey>	 thanks!
[15:56:57] <wikibugs>	 06Analytics-Kanban: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#2965551 (10mforns) a:03mforns
[16:01:06] <nuria>	 fdans: stadduppp
[16:04:34] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "I think group_size: 2 would be ok, no?  But no biggy, group_size: 1 works too." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey)
[16:18:28] <wikibugs>	 06Analytics-Kanban: Improve AQS deployment - https://phabricator.wikimedia.org/T156049#2965615 (10Nuria) a:03elukey
[16:30:50] <milimetric>	 fdans: review?
[16:31:01] <fdans>	 yeah!
[16:31:25] <fdans>	 milimetric: batcave2?
[16:31:37] <milimetric>	 no, batcave's free
[16:32:16] <milimetric>	 fdans: ^
[16:32:17] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] "One thing missing, otherwise LGTM" (031 comment) [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey)
[16:36:04] <wikibugs>	 (03PS5) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049)
[16:37:17] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey)
[16:41:54] <elukey>	 a-team: aqs1004 is serving traffic with node6, will upgrade the rest of the cluster tomorrow if metrics are good
[16:50:00] <elukey>	 all good from https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=aqs1004
[16:50:06] <elukey>	 and host is serving traffic fine
[16:56:59] <wikibugs>	 (03PS13) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548)
[16:58:09] <wikibugs>	 (03CR) 10Joal: "Comments inside. I implemented most changes from comments, please have a look to the ones you're interested in !" (0346 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal)
[17:09:33] <halfak>	 Anyone seen ottomata this morning?  
[17:09:42] <halfak>	 "morning" == morning in UTC-6
[17:15:14] <joal>	 halfak: He's been around already a bit, probably at lunch
[17:15:24] <halfak>	 Gotcha.  Cool. 
[17:15:31] <halfak>	 I'll hang out :) 
[18:00:02] <nuria>	 a-team: stafffffff
[18:04:57] <moritzm>	 FYI, I'll be installing mysql security updates on bohrium/piwik in a few minutes, should have no impact, just FYI
[18:05:22] <milimetric>	 thanks moritz
[18:06:49] <moritzm>	 and completed
[18:09:29] <wikibugs>	 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2796693 (10MelodyKramer) Hello! I am a WMF employee and would like to request access to pivot. My Wikitech username is melodykramer   Please let me know if you have any questions! mkramer@wikimed...
[18:18:45] <halfak>	 o/ ottomata 
[18:18:54] <halfak>	 I have a little slide deck for our meeting today.  OK? 
[18:19:04] <halfak>	 I think I could go through it in 5 mins without interruption
[18:19:09] <halfak>	 re. ReviewStream ^ 
[18:19:12] <halfak>	 & state tables. 
[18:19:33] <ottomata>	 ya sounds great halfak, we'll do you early/first
[18:19:38] <halfak>	 great :) 
[19:02:05] <wikibugs>	 (03CR) 10Nuria: [V: 031 C: 031] "Thank you. Will wait for marco's last look to merge" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey)
[19:03:13] <elukey>	 nuria: --^ Marco +1ed already, I fixed the issue right after he reviewed it.. ready to go :)
[19:03:19] <nuria>	 elukey: k
[19:03:26] <wikibugs>	 (03CR) 10Nuria: [V: 032 C: 032] Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey)
[19:04:09] <nuria>	 milimetric: slides for look real good
[19:04:13] <nuria>	 milimetric: point well made
[19:04:26] <milimetric>	 good, thank you for checking them out
[19:04:32] <nuria>	 milimetric: only one chnage, pivot slide is too low res i think. I can redo if you tell me teh url
[19:04:34] <nuria>	 *the
[19:04:39] <nuria>	 you wnat to display
[19:04:41] <nuria>	 *want
[19:04:55] <milimetric>	 hm, bummer, looked good on mine
[19:05:03] <milimetric>	 (getting URL)
[19:06:00] <elukey>	 going afk people, byeee o/
[19:06:06] <joal>	 bye elukey 
[19:07:04] <fdans>	 btw nuria the issue with karma is sorted out, no need to take a look at it :)
[19:07:47] <nuria>	 fdans: great, thank you.
[19:41:34] <travis-ci>	 wikimedia/mediawiki-extensions-EventLogging#631 (wmf/1.29.0-wmf.9 - 04e3fe4 : Translation updater bot): The build has errored.
[19:41:34] <travis-ci>	 Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/wmf/1.29.0-wmf.9
[19:41:34] <travis-ci>	 Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/194938409
[19:57:04] <ostriches>	 Hmm, I was trying to run a hive query, but decided to kill it. Got the following while attempting to: https://phabricator.wikimedia.org/P4799
[19:57:25] <wikibugs>	 (03CR) 10Nuria: [V: 032 C: 032] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal)
[19:58:12] <ostriches>	 (beehive, I guess, but shouldn't matter for this bit)
[19:58:21] <wikibugs>	 06Analytics-Kanban: Follo naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria)
[19:59:40] <wikibugs>	 (03CR) 10Nuria: "I think this needs a ticket so we be on the lookout about these changes on our next cluster deployment. I have created one. Please be so k" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal)
[20:13:28] <wikibugs>	 (03PS2) 10Joal: Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170)
[20:14:00] <wikibugs>	 (03CR) 10Joal: "Done @nuria :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170) (owner: 10Joal)
[20:14:48] <ottomata>	 ostriches: 
[20:14:49] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Killing_a_running_query
[20:14:56] <ottomata>	 you want to kill the yarn application, not the mapred job
[20:15:32] <ostriches>	 A-ha!
[20:15:34] <ostriches>	 Thanks :)
[20:15:41] <ostriches>	 (was just following what beehive told me :))
[20:15:50] <ottomata>	 beehive tells you to do that?
[20:15:57] <ottomata>	 (i haven't used beehive that much...)
[20:16:09] <ostriches>	 INFO  : Starting Job = job_1480065021448_204598, Tracking URL = http://analytics1001.eqiad.wmnet:8088/proxy/application_1480065021448_204598/
[20:16:09] <ostriches>	 INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1480065021448_204598
[20:16:14] <ostriches>	 When starting a job
[20:16:39] <ostriches>	 Then when trying to run that hadoop bit, you get the piece I pastebin'd
[20:17:00] <joal>	 Gone for now a-team, see you tommorrow :)
[21:25:52] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
[21:27:05] <ottomata>	 hm!
[21:27:50] <ottomata>	 java.lang.OutOfMemoryError: Java heap space
[21:27:55] <ottomata>	 never had that with history server before!
[21:28:07] <wikibugs>	 10Analytics, 10Analytics-General-or-Unknown: Number of Wikipedia Zero increasing drastically in mid March 2014 - https://phabricator.wikimedia.org/T64848#2966767 (10scfc) Is someone still looking into this?
[21:28:52] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
[21:30:07] <ottomata>	 !log restarted hadoop-mapreduce-historyserver on analytics1001.  it died to do OOM
[21:30:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:32:34] <wikibugs>	 10Analytics, 10Analytics-General-or-Unknown: Number of Wikipedia Zero increasing drastically in mid March 2014 - https://phabricator.wikimedia.org/T64848#2966797 (10DFoy) Can we identify which partner (X-CS) is responsible for the increase at that time?  I can look into more details once I have that information.
[21:45:10] <wikibugs>	 06Analytics-Kanban, 10Fundraising-Backlog, 13Patch-For-Review: Productionize banner impressions druid/pivot dataset - https://phabricator.wikimedia.org/T155141#2966828 (10mforns) @AndyRussG What are your thoughts about @JAllemandou 's comments on the patch? Would a minutely resolution be interesting for you?...
[21:56:55] <wikibugs>	 06Analytics-Kanban, 13Patch-For-Review: Follow naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria)
[21:57:16] <wikibugs>	 (03CR) 10Nuria: [V: 032 C: 032] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170) (owner: 10Joal)
[21:58:15] <wikibugs>	 06Analytics-Kanban, 13Patch-For-Review: Follow naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria) a:03JAllemandou
[21:58:37] <wikibugs>	 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review: Set charset=utf-8 in Content-Type response header from sse.js client - https://phabricator.wikimedia.org/T154328#2966900 (10Nuria) 05Open>03Resolved
[21:58:40] <wikibugs>	 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2966901 (10Nuria)
[21:59:03] <wikibugs>	 06Analytics-Kanban: Replacing standard edit metrics in dashiki with data from new edit data depot - https://phabricator.wikimedia.org/T143924#2966904 (10Nuria)
[21:59:06] <wikibugs>	 06Analytics-Kanban: Run Standard metrics on denormalized history and compare with wikistats - https://phabricator.wikimedia.org/T150023#2966903 (10Nuria) 05Open>03Resolved
[21:59:20] <wikibugs>	 06Analytics-Kanban, 10Continuous-Integration-Config, 10EventBus, 06Release-Engineering-Team, 10Wikimedia-Stream: Improve tests for KafkaSSE - https://phabricator.wikimedia.org/T150436#2966905 (10Nuria) 05Open>03Resolved
[21:59:24] <wikibugs>	 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2537741 (10Nuria)
[22:00:25] <zareen>	 hey! is anyone from analytics able to help me troubleshoot an error from running a hive query? http://pastebin.com/SQE1skgY I was able to successfully run this query over the weekend, but now after running for ~4 hours, i get an error. my guess is that it's either a temporary out of memory error on one of the reducers or some inconsistencies in the webrequest table
[22:01:04] <ottomata>	 zareen, try:
[22:01:20] <ottomata>	 yarn application -logs application_1480065021448_203470
[22:02:36] <wikibugs>	 10Analytics, 10ChangeProp, 10Edit-Review-Improvements-ReviewStream, 10EventBus, and 4 others: Set up the foundation for the ReviewStream feed - https://phabricator.wikimedia.org/T143743#2966929 (10Ottomata) Today we had a ReviewStream meeting.  We had originally planned to talk about how the 'review-stream...
[22:03:33] <zareen>	 ottomata: thanks for helping me out. here's what i get when i try that: http://pastebin.com/YvU5VDvg
[22:04:58] <ottomata>	 durrrr
[22:05:14] <ottomata>	 OH
[22:05:17] <ottomata>	 backwards sorry
[22:05:30] <ottomata>	 yarn logs -applicationId application_1480065021448_203470
[22:05:33] <ottomata>	 that zareen^
[22:06:53] <zareen>	 ottomata, trying that now. 
[22:07:14] <ottomata>	 you'll probably get a lot of output
[22:07:30] <zareen>	 yup
[22:14:15] <zareen>	 ottomata, it's still going
[22:15:02] <ottomata>	 whoa, yikes
[22:20:23] <ottomata>	 OH
[22:20:25] <ottomata>	 zareen!
[22:20:25] <ottomata>	 i think i know
[22:20:28] <ottomata>	 https://yarn.wikimedia.org/jobhistory/attempts/job_1480065021448_203470/m/FAILED
[22:20:38] <ottomata>	 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251) ... 11 more Caused by: java.io.FileNotFoundException: File does not exist: hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=text/year=2016/month=11/day=23/hour=12/000058_0
[22:20:47] <ottomata>	 year=2016/month=11/day=23/hour=12
[22:20:49] <ottomata>	 is old!
[22:20:51] <ottomata>	 probably
[22:20:59] <ottomata>	 the script that deletes old data deleted the file out from under you
[22:20:59] <ottomata>	 ?
[22:21:21] <ottomata>	 i'm not sure if that is what caused your entire job to die, but it at least failed some mappers
[22:21:25] <zareen>	 oh, it's purged since it's past 60 days?
[22:21:38] <ottomata>	 ya
[22:23:41] <zareen>	 in my query, i don't specify the days/hours to filter by so i assumed it would just give me results for what is in the table and not cause the job to fail
[22:24:02] <joal>	 ottomata, zareen : Heya
[22:24:10] <joal>	 indeed ottomata your analysis is correct
[22:24:34] <ottomata>	 ja zareen, i think we don't have many people querying the full dataset
[22:24:58] <joal>	 it happenned before: during a long running job, hive lists the files it works at the beginning of the job, and if they get deleted in the midlle of it, the entire job fails
[22:25:05] <joal>	 zareen: --
[22:25:16] <zareen>	 ah, i see
[22:25:21] <zareen>	 joal any idea how to get around this?
[22:25:29] <ottomata>	 avoid querying the whole dataaset?
[22:25:37] <ottomata>	 where month != 11
[22:25:38] <ottomata>	 :p
[22:25:38] <ottomata>	 ?
[22:26:07] <zareen>	 i guess i can specify days and only include full days where there would be data
[22:26:13] <joal>	 zareen: we have 62 days of data before deletion
[22:26:26] <zareen>	 i'll try that :) weird though, i've ran that query before with no issues
[22:26:57] <zareen>	 joal ottomata thanks so much for the help!
[22:27:02] <joal>	 zareen: from your query that have faile: ((year = 2016 AND (month = 12 OR month = 11)) OR (year = 2017))
[22:27:06] <ottomata>	 the delete job only runs occasionally, so your job and it would have to coincide
[22:27:30] <joal>	 just update it to: ((year = 2016 AND (month = 12) OR (year = 2017))
[22:27:42] <joal>	 np zareen :)
[22:27:46] <joal>	 back to sleep mode :)
[22:27:55] <zareen>	 ah, okay!
[22:31:41] <joal>	 last thing zareen: when you do requests like that over the full webrequest text partition, you read and scan 50Tb of data - please be so kind as to do them wisely, this is a lot of resources :)
[22:33:07] <zareen>	 joal, will keep that in mind!
[23:28:27] <milimetric>	 zareen: that query seems like a good candidate to sample data instead of compute everything.
[23:28:40] <milimetric>	 zareen: here's an example of using TABLESAMPLE: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques#Hive
[23:40:08] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10ArchCom-RfC, 06Discovery, and 10 others: RFC: Use YAML instead of JSON for structured on-wiki content - https://phabricator.wikimedia.org/T147158#2967302 (10Mholloway)
[23:43:54] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 06Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: EL unable to decode mobile events due to appinstallid - https://phabricator.wikimedia.org/T96940#2967325 (10Mholloway) This is pretty old.  Is it still happening or can we close it?  Tagging iOS since the events in the desc...