[02:04:05] 10Analytics, 10EventBus, 13Patch-For-Review, 06Services (done), 05WMF-deploy-2017-01-24_(1.29.0-wmf.9): EventBus produces non-canonical page urls - https://phabricator.wikimedia.org/T155066#2963812 (10Krinkle) >>! In T155066#2950587, @Pchelolo wrote: > I've created a set of interdependent patches to solv... [09:31:39] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2964466 (10elukey) @Pcoombe, @jrobell, @MeganHernandez_WMF and @spatton you should all now be able to log in https://pivot.wikimedia.org Please read the following wiki page: https://wikitech.wi... [10:57:45] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2964633 (10Pcoombe) Thanks @elukey, it's working for me. This is great! [11:40:53] hi team :] [11:42:23] o/ [12:04:52] * elukey lunch! [13:35:30] (03PS1) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 [13:37:53] (03PS2) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 [13:38:55] joal: --^ [13:40:16] (03CR) 10Joal: [C: 031] "LGTM, but no expert in SCAP, so let's have another opinion !" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey) [13:40:19] elukey: --^ [13:40:23] :) [13:40:25] thanks! [13:40:27] np ! [13:40:35] Thank you for handlingthe mess yesterday evening [13:40:58] ottomata did the magic with its hammer [13:41:00] :D :D :D [13:41:03] elukey: I'd like a confirmation: SCAP does rolling deploy, no? (meaning, it deploys one server at a time) [13:41:05] *his [13:41:07] :D [13:41:27] afaik yes when using explicit depool/re-pool, but I'll ask to Marko [13:41:41] elukey: not talking about pooling here [13:41:50] ah in general [13:41:57] elukey: yes, in general [13:42:27] elukey: I remembered scap logging to say: deoing that one server, then when done, doing that other one etc [13:42:28] I think that it tries to maximize concurrency [13:42:40] hm ... concurrency [13:42:48] and how is it defined in scap elukey ? [13:42:56] I have no idea [13:42:57] :D [13:42:59] :D [13:45:00] reading https://doc.wikimedia.org/mw-tools-scap/ [13:48:15] maybe I was confusing scap deploy with what it is used to deploy media-wiki [13:48:26] so I am 90% sure that scap does one host at the time [13:49:06] elukey: this is my understanding as well, but I'd love to have that confirmed to, let's say, 99.9% ? ;) [13:49:31] elukey: We shall ask the release-eng team [13:50:47] https://doc.wikimedia.org/mw-tools-scap/scap3/architecture.html#process-model [13:50:54] Concurrency for each stage can be either completely serial or highly parallel, again depending on configuration. For fine tuning of the groups and stage concurrency, see server_groups and batch_size under Available configuration variables. [13:51:12] Ahhh ! [13:51:17] there we go elukey :) [13:53:23] so from the config, it seems aqs is actually deployed in fully parallelized fashion, correct? [13:53:29] elukey: --^ [13:55:17] joal: yep, and we have no canary afaics [13:55:23] hm [13:56:59] anyway, thanks a lot for finding that info elukey! [13:57:20] elukey: I don't know why I recall scap doing one server at a time :( [13:58:21] I was kinda convinced of that too, but then I thought that it wouldn't make sense for mediawiki or bigger services [13:58:29] citoid afaics uses group_size 2 [13:58:36] so batches of two hosts at the time [13:59:04] (doing all the stages) [13:59:17] elukey: I was pretty sure scap could deploy in a parallel way, but was convinced it was serial by default [14:00:12] I think that we'd need the canary too [14:00:15] we usually do that [14:00:21] test on aqs1004, then proceed [14:05:53] (03PS3) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 [14:06:22] joal: --^ [14:06:44] maybe we can debate about group_size: 1 vs group_size: 2 [14:06:52] with the canary I don't see a major problem [14:07:10] it is great to think about the past deployments though :D [14:07:27] luckily we used scap-deploy --limit a lot of times [14:07:33] elukey: correct [14:07:42] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:46] elukey: I think it's because of the --limit that I recall serial deploys :) [14:08:46] elukey: I think I like the conf exisitng now (canary + serial :) [14:09:10] elukey: I wonder about having aqs1004 in both prod and canary - should it in a single place ? [14:09:30] (03CR) 10Joal: [C: 031] "Even better ! Still need to be double checked though :)" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey) [14:09:42] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [14:14:59] joal: good point! [14:15:36] about the alarm above: I am pretty sure that the script checking the node manager state counts a temporary state returned by yarn as faulty and alarms [14:15:41] elukey: refinery is deployed with scap and uses stat1002 as canary, if you wish an example [14:16:10] yes yes you are right [14:16:13] fixing it [14:16:55] elukey: about yarn, poassible - the cluster seems busy as well, so maybe one node just forgot to answer for a few seconds [14:16:59] (03PS4) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 [14:21:30] hmm, joal: https://github.com/HotelsDotCom/jasvorno [14:21:33] iiinteresting [14:21:58] rooooooh ottomata!!! [14:22:31] ottomata: webrequests? [14:23:22] ottomata: let's imagine varnish-kafka uses that lib, how much space would we save on kafka/hdfs? [14:23:47] i don't think we could do it from varnishkafka really, but perhaps camus could use it [14:23:50] or whatever we choose [14:24:47] that looks really really simple, i betcha it would be v easy to create a RecordWriterProvider for it [14:24:56] OR, i wonder if it would make integration into kafka connect easier [14:25:12] could be used for eventlogging stuff! [14:25:13] :o [14:25:57] especially if something like https://github.com/fge/json-schema-avro actually works [14:26:53] ottomata: integration into kafka connect, meaning schema server etc built for us! I think I like this idea :) [14:27:17] also ottomata, thanks for the tranquility thing - This data updated live is so awesome :) [14:27:49] I think FRTech is going to ask for this in prod ASAP ;) [14:28:49] we shoudl watch this spark streaming thing for a while before we commit to that :) [14:28:58] but ya, it looks like it isn't really taking that many resources [14:29:00] ottomata: for sure we should [14:29:18] ottomata: so far, 4 tasks of 1Go + 1CPU each [14:29:26] ottomata: and no delay [14:29:50] ya [14:49:34] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2962176 (10JAllemandou) Quick data volume checks: - How much data would this dataset represent(# lines, # Go + file format + compression, # fields) - How much variabil... [14:55:35] joal: Go? [14:55:42] is that a diff notation for GB? [14:56:02] ottomata: it's the french version, sorry :) [14:56:10] aye :) [14:56:23] ottomata: in french, we say octect for byte [14:56:55] ottomata: corrected! Thanks [14:58:57] 10Analytics, 10Analytics-Cluster: WSC data in a cube - https://phabricator.wikimedia.org/T76093#2965294 (10Ottomata) 05Open>03declined [15:00:03] elukey: I'm going to try and upgrade deployment-aqs01 to node 6. The way I did it on my machine was to uninstall, purge, and reinstall from instructions on their site: https://nodejs.org/en/download/package-manager/ [15:00:12] as in curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash - [15:00:46] is there some other procedure on labs, like a package repository that I don't know about? [15:01:15] 10Analytics, 10Analytics-Cluster, 10DBA: Purge MobileWebWikiGrok_* and MobileWebWikiGrokError_* rows older than 90 days - https://phabricator.wikimedia.org/T77918#2965310 (10Ottomata) 05Open>03declined [15:01:54] 10Analytics, 10Analytics-Cluster: Analyst has a table of Last-Access counts {bear} - https://phabricator.wikimedia.org/T101004#2965314 (10Ottomata) 05Open>03declined [15:02:23] milimetric: nononono :) [15:02:29] heh, ok [15:02:34] no worries, it's why I pinged you [15:03:13] 10Analytics, 10Wikimedia-General-or-Unknown: Browser and platform stats for logged-in vs. anon users for security and product support decisions - https://phabricator.wikimedia.org/T58575#2965317 (10Ottomata) [15:03:33] 10Analytics, 10Analytics-Cluster: Write wikitech spark tutorial - https://phabricator.wikimedia.org/T93111#2965318 (10Ottomata) 05Open>03declined Wikitech has an eventlogging spark tutorial [15:04:47] milimetric: done :) [15:05:02] 10Analytics, 06Labs, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821#2965325 (10Milimetric) No, we should leave it open and blocked on wikitech being set up properly. We could of course collect pageviews via some othe... [15:05:05] apt-get update && apt-get install nodejs nodejs-dev nodejs-legacy [15:05:18] ii nodejs 6.9.1~dfsg-1+wmf1 amd64 evented I/O for V8 javascript [15:05:21] ii nodejs-dev 6.9.1~dfsg-1+wmf1 amd64 evented I/O for V8 javascript (development files) [15:05:24] ii nodejs-legacy 6.9.1~dfsg-1+wmf1 all evented I/O for V8 javascript (legacy symlink) [15:05:41] elukey: oh ok, so these machines are set up already to go against a different package manager [15:05:55] is that public? Can I set up my box to use it? [15:05:58] elukey: can you look at this? i can't remember if this is something that affects us. if you fixed it, we can close it [15:05:59] https://phabricator.wikimedia.org/T71615 [15:06:23] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965334 (10Ottomata) [15:06:26] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to only esams caches causing unknown problems - https://phabricator.wikimedia.org/T74809#2965331 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:06:31] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to depooled servers interfering with monitoring - https://phabricator.wikimedia.org/T74649#2965338 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:06:33] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#2965341 (10Ottomata) [15:06:40] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#2965348 (10Ottomata) [15:06:43] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to deployments gone wrong - https://phabricator.wikimedia.org/T74299#2965345 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:06:47] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965354 (10Ottomata) [15:06:51] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to configuration updates - https://phabricator.wikimedia.org/T74300#762724 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:06:54] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata) [15:06:54] milimetric: we package node in debs, you can use the wikimedia repositories like https://wikitech.wikimedia.org/w/index.php?title=APT_repository [15:06:57] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to network issues - https://phabricator.wikimedia.org/T74298#2965356 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:07:05] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata) [15:07:07] 10Analytics, 10Analytics-Cluster: Kafka partition leader elections causing a drop of a few log lines - https://phabricator.wikimedia.org/T72087#2965362 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:07:11] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#722441 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:07:19] ooh, awesome, thanks elukey [15:07:28] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965375 (10Ottomata) [15:07:30] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to oozie being overwhelmed - https://phabricator.wikimedia.org/T85704#2965372 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:07:43] ottomata: never seen it before :( [15:08:10] 10Analytics, 10Analytics-Cluster, 07Epic: Epic: Webstats Collector is replaced in Refinery - https://phabricator.wikimedia.org/T70963#2965378 (10Ottomata) 05Open>03Resolved a:03Ottomata [15:08:12] hah, elukey yeah, its something old [15:08:25] i'm cleaning out some analytics cluster backlog [15:08:36] but, the ticket is related to sequence numbers vs time bucketing in hive [15:08:57] ah weird [15:08:57] i know we worked on that in the spring/summer around the varnish upgrade, ,just can't remember off the top of my head what we did [15:09:00] and if it is still a problem [15:09:26] 10Analytics, 10Analytics-Cluster, 07Epic: Epic: AnalyticsEng has fully dimentionalized Page View counts - https://phabricator.wikimedia.org/T70966#2965396 (10Ottomata) 05Open>03Resolved a:03Ottomata Good ol' druid. [15:09:33] so we definitely switched timestamp, from Start to End [15:09:39] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965404 (10Ottomata) [15:09:42] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful due to single message being missing for unknown reason - https://phabricator.wikimedia.org/T76977#2965401 (10Ottomata) 05Open>03Resolved a:03Ottomata Resolving these, as we don't use them. [15:10:09] and we have also switched monitoring [15:10:13] thanks for mforns [15:10:30] elukey, what? [15:10:34] there were tons of other issues with vk but nothing specific to timestamps other thant these IIRc [15:10:55] mforns: sorry for the extra ping, I was mentioning your work for the sequence number alarms [15:10:58] :) [15:11:24] oh! np for the ping, hehehe [15:11:29] 10Analytics, 10Analytics-Cluster: productionizing xmldump -> avro jobs - https://phabricator.wikimedia.org/T78404#2965409 (10Ottomata) 05Open>03Invalid Not enough detail, and no real plan to do this as is. There may be other ways of getting revision content into HDFS that we will explore in the future. [15:11:45] hm, ok, i'll leave it open then and note that [15:11:46] thanks elukey [15:12:09] oh actually [15:12:14] this ticket is exactly what mforns did i thikn [15:12:18] fixing the validation/alarms to deal with this [15:12:35] \o/ [15:12:44] 10Analytics, 10Analytics-Cluster: Raw webrequest partitions that were not marked successful - https://phabricator.wikimedia.org/T72085#2965417 (10Ottomata) [15:12:47] 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: Make webrequest partition validation handle races between time and sequence numbers - https://phabricator.wikimedia.org/T71615#2965414 (10Ottomata) 05Open>03Resolved a:03Ottomata @mforns did some work to make our validation and alarms account for th... [15:19:22] elukey: ok, aqs works perfectly fine under node 6, I'll assign the upgrade to you [15:19:53] 10Analytics, 10CirrusSearch, 06Discovery, 06Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#2962176 (10dcausse) I'll start, for the dimensions I'd like to have: - query_type: single valued string (# of distinct values is around 10) - syntax_used: multivalued... [15:20:00] 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2965427 (10Milimetric) a:05Milimetric>03elukey [15:20:35] milimetric: sure! [15:20:36] 06Analytics-Kanban, 06Services (blocked), 15User-mobrovac: Upgrade AQS to node 6 - https://phabricator.wikimedia.org/T155642#2949682 (10Milimetric) Tested in beta cluster after @elukey upgraded, AQS works perfectly well on node 6. [15:21:18] milimetric: I'd like to make sure that basic host metrics like CPU etc.. does not go crazy [15:21:44] but my plan is to upgrade aqs1004, wait a day, review and complete the work [15:22:52] sounds reasonable, yeah, I could build some fake load on beta cluster if you want to be more careful before trying it in prod [15:22:57] let me know [15:24:14] milimetric: if you have time yes please :) [15:25:54] ok, sure, hm... I have no comparison with old node 4 though. So I'll just make sure it doesn't blow up under load. [15:30:57] yeah [15:41:10] (03CR) 10Mforns: [C: 031] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal) [15:44:34] (03PS1) 10Fdans: [wip] Adds map visualizer to Dashiki [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/333922 (https://phabricator.wikimedia.org/T153921) [15:47:08] elukey: basic load test looks ok [15:47:19] used apache bench to throw 2000 requests at it, with different concurrency [15:47:48] memory and cpu go up normally and come back down, requests get processed ok [15:48:08] nice :) [15:48:13] with concurrency 10 longest wait is 1 second, with concurrency 100 longest wait is 13 seconds [15:48:29] is it know whether pivot as running on thorium is compatible with nodejs 6? [15:48:39] it's among the systems still on nodejs 4 [15:48:50] moritzm: should be, pivot does very little in node world [15:49:03] moritzm: I will test locally though and let you know [15:50:01] ok, thanks :-) [15:51:06] moritzm: yep, works fine on node 6 [15:52:00] milimetric: shall I upgrade it on thorium right away? [15:52:16] moritzm: yep, I can test once you do, let you know if anything went wrong [15:52:22] k, doing so [15:52:27] thx [15:52:54] done and restarted, let me know if anything breaks [15:52:58] checking [15:53:05] (03PS5) 10Mforns: [WIP] Add banner impressions jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/331794 (https://phabricator.wikimedia.org/T155141) [15:53:07] 10Analytics-EventLogging, 06Analytics-Kanban, 06Performance-Team, 07Performance, 07Regression: EventLogging schema modules take >1s to build (max: 22s) - https://phabricator.wikimedia.org/T150269#2965545 (10Ottomata) I just read over the code too, and I agree that locking code looks suspicious. If a bun... [15:54:05] looks good moritzm, thank you [15:54:22] great, thanks :-) [15:55:35] thanks! [15:56:57] 06Analytics-Kanban: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#2965551 (10mforns) a:03mforns [16:01:06] fdans: stadduppp [16:04:34] (03CR) 10Ottomata: [C: 031] "I think group_size: 2 would be ok, no? But no biggy, group_size: 1 works too." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey) [16:18:28] 06Analytics-Kanban: Improve AQS deployment - https://phabricator.wikimedia.org/T156049#2965615 (10Nuria) a:03elukey [16:30:50] fdans: review? [16:31:01] yeah! [16:31:25] milimetric: batcave2? [16:31:37] no, batcave's free [16:32:16] fdans: ^ [16:32:17] (03CR) 10Mobrovac: [C: 04-1] "One thing missing, otherwise LGTM" (031 comment) [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (owner: 10Elukey) [16:36:04] (03PS5) 10Elukey: Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) [16:37:17] (03CR) 10Mobrovac: [C: 031] Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey) [16:41:54] a-team: aqs1004 is serving traffic with node6, will upgrade the rest of the cluster tomorrow if metrics are good [16:50:00] all good from https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=aqs1004 [16:50:06] and host is serving traffic fine [16:56:59] (03PS13) 10Joal: Add mediawiki history spark jobs to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) [16:58:09] (03CR) 10Joal: "Comments inside. I implemented most changes from comments, please have a look to the ones you're interested in !" (0346 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/325312 (https://phabricator.wikimedia.org/T141548) (owner: 10Joal) [17:09:33] Anyone seen ottomata this morning? [17:09:42] "morning" == morning in UTC-6 [17:15:14] halfak: He's been around already a bit, probably at lunch [17:15:24] Gotcha. Cool. [17:15:31] I'll hang out :) [18:00:02] a-team: stafffffff [18:04:57] FYI, I'll be installing mysql security updates on bohrium/piwik in a few minutes, should have no impact, just FYI [18:05:22] thanks moritz [18:06:49] and completed [18:09:29] 06Analytics-Kanban, 15User-Elukey: Ongoing: Give me permissions in LDAP - https://phabricator.wikimedia.org/T150790#2796693 (10MelodyKramer) Hello! I am a WMF employee and would like to request access to pivot. My Wikitech username is melodykramer Please let me know if you have any questions! mkramer@wikimed... [18:18:45] o/ ottomata [18:18:54] I have a little slide deck for our meeting today. OK? [18:19:04] I think I could go through it in 5 mins without interruption [18:19:09] re. ReviewStream ^ [18:19:12] & state tables. [18:19:33] ya sounds great halfak, we'll do you early/first [18:19:38] great :) [19:02:05] (03CR) 10Nuria: [V: 031 C: 031] "Thank you. Will wait for marco's last look to merge" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey) [19:03:13] nuria: --^ Marco +1ed already, I fixed the issue right after he reviewed it.. ready to go :) [19:03:19] elukey: k [19:03:26] (03CR) 10Nuria: [V: 032 C: 032] Update AQS scap configuration [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/333905 (https://phabricator.wikimedia.org/T156049) (owner: 10Elukey) [19:04:09] milimetric: slides for look real good [19:04:13] milimetric: point well made [19:04:26] good, thank you for checking them out [19:04:32] milimetric: only one chnage, pivot slide is too low res i think. I can redo if you tell me teh url [19:04:34] *the [19:04:39] you wnat to display [19:04:41] *want [19:04:55] hm, bummer, looked good on mine [19:05:03] (getting URL) [19:06:00] going afk people, byeee o/ [19:06:06] bye elukey [19:07:04] btw nuria the issue with karma is sorted out, no need to take a look at it :) [19:07:47] fdans: great, thank you. [19:41:34] wikimedia/mediawiki-extensions-EventLogging#631 (wmf/1.29.0-wmf.9 - 04e3fe4 : Translation updater bot): The build has errored. [19:41:34] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/wmf/1.29.0-wmf.9 [19:41:34] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/194938409 [19:57:04] Hmm, I was trying to run a hive query, but decided to kill it. Got the following while attempting to: https://phabricator.wikimedia.org/P4799 [19:57:25] (03CR) 10Nuria: [V: 032 C: 032] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal) [19:58:12] (beehive, I guess, but shouldn't matter for this bit) [19:58:21] 06Analytics-Kanban: Follo naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria) [19:59:40] (03CR) 10Nuria: "I think this needs a ticket so we be on the lookout about these changes on our next cluster deployment. I have created one. Please be so k" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (owner: 10Joal) [20:13:28] (03PS2) 10Joal: Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170) [20:14:00] (03CR) 10Joal: "Done @nuria :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170) (owner: 10Joal) [20:14:48] ostriches: [20:14:49] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Killing_a_running_query [20:14:56] you want to kill the yarn application, not the mapred job [20:15:32] A-ha! [20:15:34] Thanks :) [20:15:41] (was just following what beehive told me :)) [20:15:50] beehive tells you to do that? [20:15:57] (i haven't used beehive that much...) [20:16:09] INFO : Starting Job = job_1480065021448_204598, Tracking URL = http://analytics1001.eqiad.wmnet:8088/proxy/application_1480065021448_204598/ [20:16:09] INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1480065021448_204598 [20:16:14] When starting a job [20:16:39] Then when trying to run that hadoop bit, you get the piece I pastebin'd [20:17:00] Gone for now a-team, see you tommorrow :) [21:25:52] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [21:27:05] hm! [21:27:50] java.lang.OutOfMemoryError: Java heap space [21:27:55] never had that with history server before! [21:28:07] 10Analytics, 10Analytics-General-or-Unknown: Number of Wikipedia Zero increasing drastically in mid March 2014 - https://phabricator.wikimedia.org/T64848#2966767 (10scfc) Is someone still looking into this? [21:28:52] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [21:30:07] !log restarted hadoop-mapreduce-historyserver on analytics1001. it died to do OOM [21:30:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:32:34] 10Analytics, 10Analytics-General-or-Unknown: Number of Wikipedia Zero increasing drastically in mid March 2014 - https://phabricator.wikimedia.org/T64848#2966797 (10DFoy) Can we identify which partner (X-CS) is responsible for the increase at that time? I can look into more details once I have that information. [21:45:10] 06Analytics-Kanban, 10Fundraising-Backlog, 13Patch-For-Review: Productionize banner impressions druid/pivot dataset - https://phabricator.wikimedia.org/T155141#2966828 (10mforns) @AndyRussG What are your thoughts about @JAllemandou 's comments on the patch? Would a minutely resolution be interesting for you?... [21:56:55] 06Analytics-Kanban, 13Patch-For-Review: Follow naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria) [21:57:16] (03CR) 10Nuria: [V: 032 C: 032] Update oozie job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/333668 (https://phabricator.wikimedia.org/T156170) (owner: 10Joal) [21:58:15] 06Analytics-Kanban, 13Patch-For-Review: Follow naming convention on druid jobs: ts for long unix timestamps, dt for ISO. - https://phabricator.wikimedia.org/T156170#2966359 (10Nuria) a:03JAllemandou [21:58:37] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 13Patch-For-Review: Set charset=utf-8 in Content-Type response header from sse.js client - https://phabricator.wikimedia.org/T154328#2966900 (10Nuria) 05Open>03Resolved [21:58:40] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2966901 (10Nuria) [21:59:03] 06Analytics-Kanban: Replacing standard edit metrics in dashiki with data from new edit data depot - https://phabricator.wikimedia.org/T143924#2966904 (10Nuria) [21:59:06] 06Analytics-Kanban: Run Standard metrics on denormalized history and compare with wikistats - https://phabricator.wikimedia.org/T150023#2966903 (10Nuria) 05Open>03Resolved [21:59:20] 06Analytics-Kanban, 10Continuous-Integration-Config, 10EventBus, 06Release-Engineering-Team, 10Wikimedia-Stream: Improve tests for KafkaSSE - https://phabricator.wikimedia.org/T150436#2966905 (10Nuria) 05Open>03Resolved [21:59:24] 06Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 06Services (watching), 15User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2537741 (10Nuria) [22:00:25] hey! is anyone from analytics able to help me troubleshoot an error from running a hive query? http://pastebin.com/SQE1skgY I was able to successfully run this query over the weekend, but now after running for ~4 hours, i get an error. my guess is that it's either a temporary out of memory error on one of the reducers or some inconsistencies in the webrequest table [22:01:04] zareen, try: [22:01:20] yarn application -logs application_1480065021448_203470 [22:02:36] 10Analytics, 10ChangeProp, 10Edit-Review-Improvements-ReviewStream, 10EventBus, and 4 others: Set up the foundation for the ReviewStream feed - https://phabricator.wikimedia.org/T143743#2966929 (10Ottomata) Today we had a ReviewStream meeting. We had originally planned to talk about how the 'review-stream... [22:03:33] ottomata: thanks for helping me out. here's what i get when i try that: http://pastebin.com/YvU5VDvg [22:04:58] durrrr [22:05:14] OH [22:05:17] backwards sorry [22:05:30] yarn logs -applicationId application_1480065021448_203470 [22:05:33] that zareen^ [22:06:53] ottomata, trying that now. [22:07:14] you'll probably get a lot of output [22:07:30] yup [22:14:15] ottomata, it's still going [22:15:02] whoa, yikes [22:20:23] OH [22:20:25] zareen! [22:20:25] i think i know [22:20:28] https://yarn.wikimedia.org/jobhistory/attempts/job_1480065021448_203470/m/FAILED [22:20:38] org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251) ... 11 more Caused by: java.io.FileNotFoundException: File does not exist: hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=text/year=2016/month=11/day=23/hour=12/000058_0 [22:20:47] year=2016/month=11/day=23/hour=12 [22:20:49] is old! [22:20:51] probably [22:20:59] the script that deletes old data deleted the file out from under you [22:20:59] ? [22:21:21] i'm not sure if that is what caused your entire job to die, but it at least failed some mappers [22:21:25] oh, it's purged since it's past 60 days? [22:21:38] ya [22:23:41] in my query, i don't specify the days/hours to filter by so i assumed it would just give me results for what is in the table and not cause the job to fail [22:24:02] ottomata, zareen : Heya [22:24:10] indeed ottomata your analysis is correct [22:24:34] ja zareen, i think we don't have many people querying the full dataset [22:24:58] it happenned before: during a long running job, hive lists the files it works at the beginning of the job, and if they get deleted in the midlle of it, the entire job fails [22:25:05] zareen: -- [22:25:16] ah, i see [22:25:21] joal any idea how to get around this? [22:25:29] avoid querying the whole dataaset? [22:25:37] where month != 11 [22:25:38] :p [22:25:38] ? [22:26:07] i guess i can specify days and only include full days where there would be data [22:26:13] zareen: we have 62 days of data before deletion [22:26:26] i'll try that :) weird though, i've ran that query before with no issues [22:26:57] joal ottomata thanks so much for the help! [22:27:02] zareen: from your query that have faile: ((year = 2016 AND (month = 12 OR month = 11)) OR (year = 2017)) [22:27:06] the delete job only runs occasionally, so your job and it would have to coincide [22:27:30] just update it to: ((year = 2016 AND (month = 12) OR (year = 2017)) [22:27:42] np zareen :) [22:27:46] back to sleep mode :) [22:27:55] ah, okay! [22:31:41] last thing zareen: when you do requests like that over the full webrequest text partition, you read and scan 50Tb of data - please be so kind as to do them wisely, this is a lot of resources :) [22:33:07] joal, will keep that in mind! [23:28:27] zareen: that query seems like a good candidate to sample data instead of compute everything. [23:28:40] zareen: here's an example of using TABLESAMPLE: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques#Hive [23:40:08] 10Analytics, 10Analytics-EventLogging, 10ArchCom-RfC, 06Discovery, and 10 others: RFC: Use YAML instead of JSON for structured on-wiki content - https://phabricator.wikimedia.org/T147158#2967302 (10Mholloway) [23:43:54] 10Analytics, 10Analytics-EventLogging, 06Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: EL unable to decode mobile events due to appinstallid - https://phabricator.wikimedia.org/T96940#2967325 (10Mholloway) This is pretty old. Is it still happening or can we close it? Tagging iOS since the events in the desc...