[08:24:10] o/ [08:24:19] I am working on memcached atm but if you need me ping me :) [08:24:55] joal: aqs1006 is now running debian buuut only one raid 0 is (sort of) good (5.4T), the other is 180GB :P [09:37:57] (PS1) Joal: Update oozie load job and error emails utility [analytics/refinery] - https://gerrit.wikimedia.org/r/288895 (https://phabricator.wikimedia.org/T134876) [09:51:51] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2296612 (elukey) Tried to re-install Debian on aqs1006 and I was able to boot correctly, but indeed the receipe is not doing what I need: ``` root@aqs1006:~# cat /... [10:09:50] Analytics-Kanban: Remove firewsll blocking Spark on stat1004 - https://phabricator.wikimedia.org/T135369#2296653 (JAllemandou) [10:12:21] joal: quick question about https://gerrit.wikimedia.org/r/#/c/288895/1/oozie/util/send_error_email/workflow.xml - can we have the concatenation of URL + parent-id? [10:13:16] I was wondering if it would be useful to have something like - parent URL: https://hue.wikimedia.org/oozie/list_oozie_workflows/parent-id [10:36:12] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2296700 (elukey) Had a chat with @Volans and after seeing what fdisk shows it the partman recipe looks wrong. Each disk has the following layout: ``` Device Bo... [10:48:34] * elukey lunch! [11:31:40] ah just realized that Jo is on public holiday :) [11:55:27] * elukey finally got an important partman lesson [11:56:06] # Specify the disks to be partitioned. They will all get the same layout, [11:56:09] # so this will only work if the disks are the same size. [13:36:08] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297061 (elukey) New recipe: - RAID10 between 8 disks, 10GB partitions (~40GB in total) - RAID0 between 4 disks, 5.7TB total - RAID0 between 4 disks, 5.7 TB total... [13:47:08] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2297114 (JAllemandou) @GWicke : Since I was quite convinced by your idea, I did some more detailed analysis with the schemed you suggested. **Setup:** For the month of march, I took every r... [13:48:15] elukey: Thanks for having sorted that partman stuff ! [13:48:31] I'll try to remember your lesson learnt if I ever have to use it ! [13:48:46] elukey: any idea why the system has issues? [13:51:47] joal! o/ [13:52:06] what issues? [13:52:17] I am missing some details probably [14:27:27] good morningggggg [14:29:18] ottomata: o/ [14:32:35] elukey: that 8 disk raid 10 thing looks fine [14:32:35] :) [14:32:37] does that work? [14:34:09] yeppa! [14:34:43] it is the only way that partman can work, I have finally made peace with myself [14:34:55] each disk needs to be symmetric to the others [14:35:29] at least now I can read a partman recipe and hope to get something out of it [14:35:35] :D [14:37:26] ottomata: I know you were right about partman but I needed to figure out why, it was bugging me too much :D [14:37:28] an noble goal accomplished [14:37:34] no its good! [14:37:34] haha [14:37:39] yeah, good to understand [14:40:05] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297242 (elukey) Installed successfully aqs 1004/5, but 1005 fails with: ``` Loading Linux 4.4.0-1-amd64 ... Loading initial ramdisk ... [ 0.113680] [Firmware B... [14:41:23] joal: hm, i don't think base firewall should be on stat1004 [14:41:31] but, i'm trying stat1002 again now, and it seems to just hang too [14:41:32] hmmm [14:42:44] hm, elukey can we remove statistics::migration? [14:43:29] ottomata: sure, but I wanted to put a regular backup later on during the quarter.. I guess that we can revise it [14:43:37] I mean, we can drop it for the moment [14:45:48] Analytics-Kanban: Remove firewsll blocking Spark on stat1004 - https://phabricator.wikimedia.org/T135369#2297281 (Ottomata) Weird, base::firewall shouldn't be applied here. I just removed `/etc/ferm/conf.d/00_main.conf` and reloaded ferm. I think it works now. [14:46:47] this one is weird [14:46:55] leftover? [14:50:37] * elukey afk for 30 mins [15:02:34] ottomata: tried a few times from stat1004 and spark never started :( [15:02:41] joal: now? [15:02:42] ottomata: however works fine on stat1002 [15:02:42] or before? [15:02:56] today, last week, maybe the week before (can't recall)? [15:03:06] today, last week for sure at least [15:03:14] [ 0.113667] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [15:03:17] Loading, please wait... [15:03:20] mdadm: No devices listed in conf file were found. [15:03:20] * elukey cries [15:03:39] * joal hugs elukey [15:04:53] ok, i just changed something it on stat1004 [15:04:55] joal: can you try again [15:04:58] elukey: daewww [15:05:51] that was aqs1005, 1004/6 installed correctly.. it seems a hp bug [15:06:00] need to talk with rob :D [15:06:29] (CR) Ottomata: [C: 2 V: 2] "NICE!!!!" (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/288895 (https://phabricator.wikimedia.org/T134876) (owner: Joal) [15:08:48] ottomata: works ! [15:08:51] ottomata: Thanks :) [15:11:20] ottomata: Not sure it was on purpose, but your comment made me catch a bug !!!! [15:13:01] joal: didn't see https://gerrit.wikimedia.org/r/#/c/288895/1/oozie/webrequest/load/workflow.xml before making my comments today, sorry :) [15:13:19] (PS1) Joal: Correct bug in load job error email sending [analytics/refinery] - https://gerrit.wikimedia.org/r/288959 [15:13:32] np :) [15:13:57] (CR) Joal: [C: 2 V: 2] "self merging bug." [analytics/refinery] - https://gerrit.wikimedia.org/r/288959 (owner: Joal) [15:14:50] ottomata: Now that you see that version of the code, I'd have liked to be able to use a function an reuse the parent_id parameter instead of having to define a new hue_url parameter [15:15:16] I tried a lot of different things, but the functions don't get interpreted :( [15:15:25] So here we are :) [15:16:12] a function? [15:18:00] (Abandoned) Joal: [WIP] Add spark utility to parse wikidumps [analytics/refinery/source] - https://gerrit.wikimedia.org/r/205277 (owner: Joal) [15:19:48] (CR) Joal: "Merging to prepare tomorrow's deploy." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/288458 (https://phabricator.wikimedia.org/T135168) (owner: Nuria) [15:20:01] (CR) Joal: [C: 2 V: 2] "Merging to prepare tomorrow's deploy." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/288458 (https://phabricator.wikimedia.org/T135168) (owner: Nuria) [15:25:53] (PS1) Joal: Update changelog.md to deploy v0.0.31 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/288966 [15:28:02] ottomata: there are very small differences in URLs between the one to list all workflows and the one to access one by ID, so I wanted not to use a parameter but use oozie EL possibilities : ${replaceAll(blah, blah)} (withtout success though) [15:28:33] ahhh [15:28:34] i see [15:28:48] ah, hm, ah well it is probably fine [15:29:01] I mean, I spent too much on that already, so KISS [15:29:18] ottomata: If you don't mind, CR for changelog.md ? [15:31:18] (CR) Ottomata: [C: 2 V: 2] Update changelog.md to deploy v0.0.31 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/288966 (owner: Joal) [15:31:20] done [15:40:35] thanks ottomata [15:49:39] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2297601 (Nuria) Open>Resolved [15:50:25] Analytics-Kanban: Submit an issue and a PR to fix pageview.js bugs - https://phabricator.wikimedia.org/T133400#2297602 (Nuria) Open>Resolved [15:51:03] Analytics-Kanban: Test cassandra compactions on new AQS nodes - https://phabricator.wikimedia.org/T135145#2297605 (elukey) p:Triage>High a:elukey [15:51:23] Analytics-Kanban: Test cassandra compactions on new AQS nodes - https://phabricator.wikimedia.org/T135145#2289668 (elukey) https://gerrit.wikimedia.org/r/#/c/288373/ [15:52:42] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2297612 (GWicke) > The increased number of requests and the very small gain in hit ratio makes me think it's actually not worth (or at least now, maybe later if usage evolves). Fair enough.... [16:00:21] a-team: standduppp [16:00:41] cc madhuvishy joal [16:01:09] nuria_: trying to join [16:01:37] AHHH [16:06:09] OH i have to keep going with the kafka broker stuff, elukey ! duh! [16:06:12] will do some of that now i thikn.. [16:06:12] fyi [16:06:19] gonna restart the next one [16:06:36] ottomata: sure! otherwise we can do it after the ops meeting? [16:06:41] i'm in ops meeting now [16:06:47] yep me too :) [16:06:48] its mostly waiting so, i'll just do it one by one [16:06:49] will log [16:06:57] all right :) [16:14:36] fyi a-team i will be doing the rest of the kafka broker bounces today to finish up the upgrade [16:14:47] i'll try to keep the eventlogging alerts to a minimum [16:14:51] Great ottomata ! [16:14:56] but i will have to restart eventlogging as part of this each time [16:14:57] :/ [16:30:10] ottomata: statsv seems to have issues, already checked? Otherwise I'll do it [16:30:27] (PS1) Joal: Add jars v0.0.31 to artefacts [analytics/refinery] - https://gerrit.wikimedia.org/r/288982 [16:30:31] logging in to batcave [16:30:46] trying that is [16:31:46] elukey: haven't checked hm [16:31:57] ottomata: recovered now [16:32:06] did you just bounce it? [16:32:20] yep but it recovered before my action (I think) [16:32:23] ok [16:32:24] hm [16:33:52] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [16:34:37] hello event logging! [16:34:42] we missed you [16:35:31] Analytics: Investigate requests flagged as pageview in analytics header coming from bots - https://phabricator.wikimedia.org/T135251#2297813 (JAllemandou) Weird thing: Some requests having the `pageview=1 ` x_analytics header flag are coming from bots (googlebot, from what we saw). [16:35:37] Analytics, Pageviews-API, Wikidata: "egranary digital library system" UA should be listed as a spider - https://phabricator.wikimedia.org/T135164#2290291 (Nuria) .Please try to notify owner of UA policy. If they add the word "bot" to UA this would automatically be marked as spider. [16:35:59] hehe [16:40:10] Analytics, Editing-Analysis: Move contents of ee-dashboards to edit-analysis.wmflabs.org - https://phabricator.wikimedia.org/T135174#2290688 (Nuria) Let's try to work with neil on this, changes are easy enough that can be done by anyone. [16:40:21] (CR) Ottomata: [C: 2] Add jars v0.0.31 to artefacts [analytics/refinery] - https://gerrit.wikimedia.org/r/288982 (owner: Joal) [16:40:33] joal: +2 merge away [16:43:28] thanks ottomata ! [16:43:33] Analytics, Pageviews-API: Invalid API input returns 404 instead of 500 or 400 - https://phabricator.wikimedia.org/T134964#2283895 (Nuria) The problem here is how to decide whether this is a project for which we count pageviews (we do not count pageviews yet for ALL projects) Now, one thing to do here w... [16:44:37] Analytics, Pageviews-API: Invalid API input returns 404 instead of 500 or 400 - https://phabricator.wikimedia.org/T134964#2297874 (Nuria) [16:45:14] Analytics, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2297878 (Nuria) Open>Resolved [16:45:42] Analytics, Pageviews-API: API incorrectly complains about missing data instead of wrong wiki name - https://phabricator.wikimedia.org/T134926#2297879 (Nuria) [16:45:44] Analytics, Pageviews-API: Improve pageviews error messages on invalid project - https://phabricator.wikimedia.org/T129899#2119381 (Nuria) [16:48:55] Analytics, Community-Tech, Pageviews-API, Tool-Labs-tools-Other, and 2 others: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2297886 (Nuria) Seems that this is already done: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2016-... [16:49:14] Analytics, Community-Tech, Pageviews-API, Tool-Labs-tools-Other, and 2 others: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2297887 (Nuria) Open>Resolved [16:52:21] Analytics, Editing-Analysis, Notifications, Collab-Team-2016-Apr-Jun-Q4: Numerous Notification Tracking Graphs Stopped Working at End of 2015 - https://phabricator.wikimedia.org/T132116#2297911 (Nuria) @jmatazzoni : if you are interested on this data you can help us migrate the scripts that harv... [17:04:02] milimetric: joal edit data meeting? [17:04:19] we're doing it in batcve [17:04:24] ohh [17:04:27] doh [17:04:30] duh [17:09:03] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297943 (elukey) Really weird, after rebooting a couple of times: ``` Loading Linux 4.4.0-1-amd64 ... Loading initial ramdisk ... [ 0.113896] [Firmware Bug]: th... [17:17:46] ottomata: I am about to logoff, but if you need me I can stay more.. I checked grafana and all looks good! [17:18:52] cool! tahnks elukey! all is well [17:19:07] i'm going to merge the patch to remove the kafka 08 conditional in a bit, but that should be a no-op [17:25:02] (PS1) Joal: Bump webrequest load job jar and record versions [analytics/refinery] - https://gerrit.wikimedia.org/r/288988 [17:25:53] (CR) Joal: [C: 2 V: 2] "Self merging for deploy." [analytics/refinery] - https://gerrit.wikimedia.org/r/288982 (owner: Joal) [17:29:15] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [17:31:16] ottomata: last CR for today (like that I dpeloy :) please :) [17:33:05] (CR) Ottomata: [C: 2 V: 2] Bump webrequest load job jar and record versions [analytics/refinery] - https://gerrit.wikimedia.org/r/288988 (owner: Joal) [17:33:06] merged joal : [17:33:08] ) [17:33:54] Thanks mate ! [17:34:20] !log deploying refinery from tin [17:39:46] joal: you should be on vacation! [17:39:49] :) [17:39:55] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (Ottomata) AAAnnnd we are done! [17:40:02] anyhow, aqs1005 now boots \o/ [17:40:20] elukey: I'm never on vacation when other French people are ;) [17:40:28] elukey: you ROCK ! [17:40:44] ahhh okok didn't know that :) [17:40:49] !log Deploying refinery on hdfs [17:41:08] tomorrow I'll deploy aqs, finger crossed :) [17:41:10] byyyeeeee [17:42:34] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2298054 (Ottomata) HMm, @elukey let's keep an eye on Broker Log Size: https://grafana.wikimedia.org/dashboard/db/kafka?pa... [17:42:36] elukey : bye ! [17:56:36] mforns: let me know if you need help testing dashiki [17:56:49] nuria_, yes will do [18:27:08] Analytics, Pageviews-API: Add support for outreachwiki to pageviews API - https://phabricator.wikimedia.org/T132313#2298350 (MusikAnimal) [18:41:03] mforns, milimetric want to give me feedback about couple mocks? [18:41:15] on batcave? [18:41:21] nuria_, sure omw [18:41:48] brt [19:12:41] Analytics-Kanban, Patch-For-Review: Pageview definition bug for apps pageviews on rest endpoint - https://phabricator.wikimedia.org/T135168#2298511 (JAllemandou) a:Nuria [19:20:37] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2298573 (GWicke) It's also worth keeping in mind that traffic levels are not static, and current levels are fairly low. At double the traffic, these numbers will quickly look very different.... [19:31:36] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2298654 (JAllemandou) @GWicke : Thanks for you comments, they definitely helped in shaping the future :) [19:37:46] (PS1) Milimetric: Moves other beta feature reports where they belong [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/289004 (https://phabricator.wikimedia.org/T126549) [19:39:01] (PS1) Milimetric: Create flow-beta-features directory [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/289005 (https://phabricator.wikimedia.org/T126549) [19:39:28] (PS1) Milimetric: Create ee-beta-features directory [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/289006 (https://phabricator.wikimedia.org/T126549) [19:42:29] (PS2) Milimetric: Create ee-beta-features directory [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/289006 (https://phabricator.wikimedia.org/T126549) [19:42:48] (PS2) Milimetric: Moves other beta feature reports where they belong [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/289004 (https://phabricator.wikimedia.org/T126549) [19:43:02] (PS2) Milimetric: Create flow-beta-features directory [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/289005 (https://phabricator.wikimedia.org/T126549) [19:44:44] ok mforns, I think that should do it for the limn-data cleanup stuff [19:44:52] I added you so you can review but there's no rush [19:45:01] we just have to coordinate merging puppet with config whenever [19:45:17] a-team: did anyone start on the rate limiting feature or shall I? [19:45:41] milimetric: Heya, didn't start but read a bit [19:45:45] milimetric, thanks, I will look into it [19:46:16] milimetric: let me know if you want to exchange before starting [19:46:17] joal: you want me to wait until tomorrow and you can debrief me? I can read up more on druid and think about SCD [19:46:27] if you're still working tonight, sure, but didn't wanna keep you [19:46:48] milimetric: as you prefer, the thing sounded pretty straightforward [19:47:04] to me too. you wanna chat in the batcave? [19:47:06] milimetric: let's batcave for a few minutes :) [19:49:59] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2298831 (Milimetric) a:Milimetric [20:54:57] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2299038 (Milimetric) Some more useful links from talking to Gabriel: passing headers on from the front-end restbase: https://github.com/wikimedia/restbase/blob/master/v1/content.yaml#L252 a rate-limit filter:... [20:58:24] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2299055 (Milimetric) The PR for this change: https://github.com/wikimedia/restbase/pull/614 [20:58:44] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2292830 (Milimetric) p:Triage>Normal [20:59:00] Analytics-Kanban: Spike - Slowly Changing Dimensions on Druid - https://phabricator.wikimedia.org/T134792#2299058 (Milimetric) a:Milimetric [21:03:58] (PS1) Nuria: Testing push to analytics.wikimedia.org through gerrit [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289062 [21:04:21] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2299064 (kaldari) I sent an email to our contacts at Microsoft asking about whether or no... [21:08:05] (PS2) Nuria: Testing push to analytics.wikimedia.org through gerrit [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289062 (https://phabricator.wikimedia.org/T134506) [21:10:23] milimetric: getting there little by little: https://www.dropbox.com/s/d82s750l46lmd1w/Screen%20Shot%202016-05-16%20at%202.09.16%20PM.png?dl=0 [21:10:53] milimetric: i think ii should be done tomorrow cc mforns [21:23:49] Analytics, Pageviews-API: Invalid API input returns 404 instead of 500 or 400 - https://phabricator.wikimedia.org/T134964#2299106 (Nemo_bis) I'm confused by the merge. I don't care about the error message, only about the status code. [21:35:18] nuria_, cooool! [22:15:28] (PS10) Mforns: [WIP] Fix unique devices bugs [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) [22:17:26] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2299256 (kaldari) Reply from Microsoft: "We don’t allow sub-syndication of our API, so un... [22:23:28] (PS11) Mforns: [WIP] Fix unique devices bugs [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) [22:34:00] (CR) Mforns: [WIP] Fix unique devices bugs (10 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [22:35:16] nuria_, milimetric ^ I think this can be reviewed [22:35:29] good night team! see you tomorrow :] [22:51:38] Analytics, Editing-Analysis, Notifications, Collab-Team-2016-Apr-Jun-Q4: Numerous Notification Tracking Graphs Stopped Working at End of 2015 - https://phabricator.wikimedia.org/T132116#2299351 (jmatazzoni) @Nuria writes: > if you are interested on this data you can help us migrate the scripts...