[00:36:38] Analytics-Kanban: Missing pageviews dumps files for October 1 - 26 - https://phabricator.wikimedia.org/T146029#2654588 (Milimetric) This is now done. The rsync still has to happen so the files won't show up for another hour or so. I'll check on it in the morning. [00:40:44] Analytics-Kanban: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324#2654592 (Milimetric) Well, so this is a one-off query. With it, I ran numbers for July 2016 and August 2016, for all 33K pages in WPMED. That's what the milimetric.wikiproject_medicine_page_counts table contain... [01:58:28] Analytics-Kanban: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324#2654742 (Doc_James) Per "That's what the milimetric.wikiproject_medicine_page_counts table contains right now." were do I find this table? [06:25:22] !log removed aqs100[123] from live traffic [06:25:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [06:25:29] elukey: YAY ! [06:25:59] * joal is going to have a closer look to charts this morning [06:27:21] elukey: cassandra restart yesterday had the usual effect on latency :) [06:28:00] joal: I restarted only aqs100[56] and they were not taking traffic, so I think the latency improvements are the new hosts this time [06:28:13] elukey: Ahhhhh [06:28:20] elukey: That would make sense :) [06:28:27] :) [06:28:32] the only weird thing is https://grafana.wikimedia.org/dashboard/db/aqs-elukey?panelId=14&fullscreen [06:28:45] but probably it is a temporary thing [06:28:47] elukey: The drop in latency is impressive [06:29:02] elukey: Must be related to LVS change [06:29:07] Let's monitor [06:29:08] yeah [06:29:25] but the switch has happened! \o/ [06:29:28] elukey: Already back to normal [06:29:58] elukey: we're not far from having the right to have hangout beer session ;) [06:30:44] joal: yessssssss [06:31:45] I also seen the oozie emails :/ [06:31:53] will try to double check later on with traffic [06:32:02] elukey: if latency stays the same, looks like a crzy win :) [06:32:12] https://gerrit.wikimedia.org/r/#/c/311415/6 is +1ed so I'll try to schedule a rollout [06:32:22] yeah it seems crazy :D [06:35:05] P99 is a bit bumpy [06:35:09] for some reason [06:35:36] hm [06:36:55] elukey: let's give it a few minutes to stabilize :) [06:37:08] yeah, mean and p75 are really crazy :D [06:37:23] brb in a bit! [07:11:39] I used tcpdump to check traffic on aqs and we are indeed returning HTTP 200s :D [07:11:46] it seemed too good to be true [07:12:42] elukey: 2 levers used at same time (SSDs + compaction) --> strong improvement ! [07:13:07] elukey: CPU usage on new-aqs is close to 0 :) [08:50:13] schana: Hey [08:50:24] hi joal [08:50:40] schana: Can we move the meeting, like 15 minutes later? [08:50:44] sure [08:50:53] schana: my son is not yet awake to go to the creche :S [08:51:02] joal: take a look (when you have time) to poor varnish kafka in upload :( https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=10&fullscreen [08:51:07] no worries - gives me time for more coffee [08:51:12] schana: thanks :) [08:51:22] schana: will ping you when back from the creche [08:51:26] * elukey takes inspiration from schana [08:51:34] * elukey coffee [08:51:36] :) [08:51:39] :) [08:51:42] elukey: poor vk :( [08:51:48] coffee :) [08:51:57] Going to the creche ! [09:30:06] gwicke: latency is now even better without the old hosts serving traffic :) [09:35:27] schana: ready ! [09:35:54] schana: https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave [10:37:34] elukey: Hi ! [10:37:36] elukey: around? [10:40:09] sure [10:40:32] elukey: can you add schana in the analytics group in gerrit? [10:40:47] username: nschaaf [10:40:52] elukey: he's learning oozie (poor guy), so let's be nice and allow him to review :) [10:41:47] mmmm I think that we should create a separate group limited to certain things (for example, not allowed to merge) [10:41:59] (nothing against you schana sorry :) [10:42:00] elukey: isn't that the case? [10:42:14] well you are allowed to merge no? [10:42:38] I am, but I'm in analytics-dev [10:42:54] I don't know anything in gerrit elukey, so I let you manage [10:43:12] ahhh okok, so there are multiple ones [10:43:18] I'll check in a bit [10:43:20] thanks :) [10:43:25] elukey: But I don't know the various rights :( [10:44:19] elukey: I think Analytics group has every right in analytics project [10:44:47] elukey: But if you look in the people in analytics group, there already are a bunh [10:47:29] Analytics, Research-and-Data, Research-collaborations, Research-management: Oozie job to extract data for WDQS research - https://phabricator.wikimedia.org/T146064#2655584 (JAllemandou) a:schana [10:56:53] (PS1) Mforns: [WIP] Add 2 wikistats reports [analytics/reportupdater-queries] - https://gerrit.wikimedia.org/r/311961 (https://phabricator.wikimedia.org/T141479) [11:09:21] so joal https://gerrit.wikimedia.org/r/#/admin/projects/analytics,access seems to indicate that 'Analytics' is owner [11:09:29] can submit, etc... [11:09:37] right elukey, saw that [11:09:53] but as you have pointed out a lot of people are in there [11:10:10] I think maybe we should have analytics-dev as owners, and analytics as potential review-submitters? [11:10:31] that would be create reference right? [11:11:01] elukey: I assume so, but I'm not sure [11:12:14] ok so I added schana to Analytics, it will not be a huge problem for the moment [11:12:21] I am going to send an email to the team [11:12:22] elukey: thanks ! [11:13:36] schana: You should be able to git review now :) [11:13:53] okay joal [11:14:11] schana: Please add me, ottomata and nuria as reviewers :) [11:14:21] (PS1) Nschaaf: Add Oozie job to extract data for WDQS research [analytics/refinery] - https://gerrit.wikimedia.org/r/311964 (https://phabricator.wikimedia.org/T146064) [11:14:52] schana: Do we spend some time testing? [11:15:18] sure [11:15:27] schana: to the batcave :) [11:29:31] * elukey lunch! [11:29:58] (PS2) Nschaaf: Add Oozie job to extract data for WDQS research [analytics/refinery] - https://gerrit.wikimedia.org/r/311964 (https://phabricator.wikimedia.org/T146064) [11:31:30] (PS2) Mforns: [WIP] Add 2 wikistats reports [analytics/reportupdater-queries] - https://gerrit.wikimedia.org/r/311961 (https://phabricator.wikimedia.org/T141479) [11:43:10] Analytics-Cluster, Analytics-Kanban: Deploy hive-site.xml to HDFS separately from refinery - https://phabricator.wikimedia.org/T133208#2655683 (JAllemandou) hive-site.xml is available in any oozie worker but oozie can't read it as local file, it needs to be on HDFS. Therefore a puppet trick is needed for... [11:54:39] taking a break a-team [12:17:58] Analytics-Tech-community-metrics: Git repo blacklist config not applied on wikimedia.biterg.io? - https://phabricator.wikimedia.org/T146135#2651767 (Aklapper) a:Dicortazar [12:42:02] Analytics-Tech-community-metrics, Regression: Git repo blacklist config not applied on wikimedia.biterg.io? - https://phabricator.wikimedia.org/T146135#2655773 (Aklapper) [13:14:48] (PS3) Mforns: [WIP] Add 2 wikistats reports [analytics/reportupdater-queries] - https://gerrit.wikimedia.org/r/311961 (https://phabricator.wikimedia.org/T141479) [13:44:28] milimetric, yt? [13:52:13] Hey mforns, just so that you know: I have a corrected version of our data in /user/joal/mwhist_2/denorm [13:52:28] joal, did it work? :] [13:52:48] Schema changes a bit (addition of revision_is_deleted and revision_is_reverted) [13:52:53] it didn't mforns :( [13:53:17] mforns: Seems like a confirmation that we need to optimize for performance :( [13:54:47] joal, but did the processing of the "reactivate" events work? [13:55:09] mforns: I don't have this bit done no [13:55:19] ok ok [13:55:44] mforns: update is not needed, was just letting you know :) [13:55:47] joal, on my side I also found a performance issue with wikistats metrics in hive with RU [13:55:53] joal, ok [13:56:15] mforns: Oh? [13:56:38] joal, that's what I wanted to explain milimetric, but if you're interested, we can talk [13:56:51] to the cave [13:56:53] ok [14:45:05] elukey: good for throttling changes? [14:45:42] nuria_: hola :) [14:45:48] elukey: ah hola!!! [14:45:49] did you see the latency changes? [14:46:02] elukey: got carried over.... [14:46:10] elukey: no, i just looked at sstables [14:46:17] elukey: let me look [14:47:41] elukey: wait ... [14:48:00] ahahha [14:48:13] me and joseph had the same reaction [14:48:21] I started to look what was wrong [14:48:33] but everything is working [14:48:34] elukey: righttt, [14:48:47] P99 is now ~100ms [14:48:49] elukey: but request ratios are even higher [14:48:53] elukey: correct? [14:49:38] elukey: Ahhh it totally makes sense, it is a sustain 20 which threshold value [14:49:57] it is only flying now [14:49:58] :D [14:49:59] elukey: wow, *PAT In the back to team* [14:50:42] elukey: ok, will bump threshold [14:51:56] super [14:52:04] maybe let's do it in small increase [14:52:07] just to be sure [14:52:18] *small increase steps [14:57:50] elukey: let's talk about it in standup. [15:01:16] a-team: standduppp [15:04:53] (correcting myself - P99 is max 1s now) [15:07:02] Analytics-Kanban, Easy, Patch-For-Review: Implement re-run script for reportupdater - https://phabricator.wikimedia.org/T117538#2656089 (Nuria) Open>Resolved [15:08:16] Analytics-Kanban, Patch-For-Review: Make reportupdater support passing the values of an explode_by using a file path - https://phabricator.wikimedia.org/T132481#2656090 (Nuria) Open>Resolved [15:09:19] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: More solid Eventlogging alarms for raw/validated {oryx} - https://phabricator.wikimedia.org/T116035#2656093 (Nuria) Open>Resolved [15:09:39] Analytics-Kanban: Compare early results of Wikistats 2.0 with Wikistats 1.0 - https://phabricator.wikimedia.org/T141536#2656095 (Nuria) [15:09:41] Analytics-Kanban: Create clean simplewiki output from edit history reconstruction - https://phabricator.wikimedia.org/T143321#2656094 (Nuria) Open>Resolved [15:12:52] nuria_: it says we're not allowed to join the call anymore [15:12:59] andrew's trying [15:13:06] btw the table I was talking about is wikiproject_medicine_page_counts [15:13:08] milimetric: session expired [15:13:16] milimetric: try incognito [15:13:44] ottomata: just said I'd need some help on puppetizing copying hive-site.xml to hdfs [15:14:37] k [15:15:44] milimetric: can you guys hear? [15:16:56] Analytics-Kanban: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324#2656105 (Milimetric) The totals for all articles by month: July 2016: 179253171 August 2016: 190445556 The table is not accessible without an NDA, so I think only Amir has access to it. It's a Hive table, acce... [15:17:45] Analytics-Kanban: Missing pageviews dumps files for October 1 - 26 - https://phabricator.wikimedia.org/T146029#2656122 (Milimetric) (the files are there, I just checked, this can be closed) [15:18:29] milimetric & ottomata : can you hear us? [15:18:45] we can year you fine [15:18:49] hear* [15:18:50] nuria_: [15:18:57] we will type [15:20:48] nuria_: elukey: so if we go too low we'll get 429s? [15:21:24] or are we getting no 429s right now and we just want to increase until before we start gettting 500s? [15:23:17] thx [15:27:45] reportupdater could use spark [15:27:57] nuria_ / joal ^, exactly [15:28:43] ohwe talking here [15:28:45] ja sounds fine [15:29:23] we should also think about a server that filters results from a larger file on demand [15:29:37] like, one file, all data in it, and a very simple filtering endpoint [15:30:49] +1 design meeting [15:31:04] I have more thoughts that intersect other problems I want to bring into this [15:33:15] or maybe not reportupdater, maybe an intermediate table would work actually :) [15:34:13] my point is: maybe we don't want to make all output files, we just make a single output file that can be queried by dashiki with a simple filter [15:34:58] I think what you're doing makes sense, but yeah, we might wanna think about the whole pipeline [15:35:54] trending edits [15:35:59] ottomata: wanted to talk about it [15:36:29] (and me, I always wanna talk about everything :)) [15:36:48] thx! [15:37:22] I'm intrerested about trending edits guys :) [15:45:58] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2656262 (GWicke) From a purely technical perspective, to me SSE looks like a better fit for a consumption use case. It avoids leaving the REST / HTTP2 world, which makes it easier to set u... [15:56:10] joal: check out this ticket and subtickets if you are interested [15:57:28] https://phabricator.wikimedia.org/T140102 [16:11:13] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2656342 (Nuria) >we could consider offering both SSE (in REST API) & socket.io (separate domain, as non-REST). Given that the prototype is written with socket.io and working well and that SS... [16:13:59] thx ottomata, will check [16:26:09] mobrovac: restbase is still only on github so to change throttling for pageview api we need a pull request correct? [16:26:28] yup nuria_ [16:30:53] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2656395 (GWicke) Background on different streaming options: - https://banksco.de/p/state-of-realtime-web-2016.html - https://samsaffron.com/archive/2015/12/29/websockets-caution-required... [16:42:27] mobrovac: thank you, see if this makes sense: https://github.com/wikimedia/restbase/pull/679/commits/ffa89575d67b419d08acbca4213b56d7afd2eb7f [16:45:21] nuria_: commented on the PR [16:45:46] mobrovac: k, changing [16:56:58] mobrovac: sorry, forgot about the multiple places, changed now [17:00:50] k [17:16:40] hello analytics team! [17:16:47] o/ [17:16:57] as of this week, precise self hosted puppetmasters are even less supported than they were [17:17:02] and there's pretty much only one left, which is... limn1 [17:17:10] nuria_: the pr still says the limit is 100, but in the comments you say that you guys settled on 50 [17:17:31] but the labs team has already disawoved any support for limn1 (https://phabricator.wikimedia.org/T101763#1864407), so we're not going to do anythingabout it (puppet also has been disabled there for 10 months now!) [17:17:33] just a headsup :) [17:19:48] elukey: o/ [17:21:58] milimetric: nuria_ ^ just pinging to make sure you saw my announcement re: lim1 :) [17:27:33] got it yuvipanda, we have some plans to migrate things away from that instance [17:28:18] elukey: ok! do remember that it can die at any point and might not come back up, and we won't be able to support you :) [17:30:04] sure :) [17:34:15] Analytics-Kanban, Patch-For-Review: Productionize Pivot UI - https://phabricator.wikimedia.org/T138262#2395083 (Ottomata) OOOk, Dan and I spent some time understanding how pivot initializes itself, and what would be required to integrate it with service-runner well. Its a little nasty. pivot is written... [17:40:59] mobrovac: wait, i changed the commit message & limits, sorry, maybe I did not sent that CR? ah, I see, git misshap [17:41:36] mobrovac: cause this is NOT gerrit, man... my fault, sorry again [17:43:05] !log installed varnishkafka 1.0.12-1 on cp3034.esams [17:43:11] cc ottomata --^ [17:43:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [17:43:21] :) [17:43:25] kafkacat from stat1002 (filtered for cp3034) looks good [17:43:28] great [17:44:13] I'll let it boil for this night and then upload it to reprepro if nothing explodes [17:44:14] but just wanted to let you know in case of fire :) [17:47:36] team logging off! [17:47:40] talk with you tomorrow :) [17:47:43] byyeeeeee o/ [17:56:19] bravo on the new AQS cluster! Just did some rigorous testing and everything ran very smoothly :) you all are awesome!!! :D [17:57:07] thanks musikanimal, that's great to hear, elukey, joal, nuria, much props for getting that done [18:00:30] Analytics-Kanban: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324#2656805 (Doc_James) Perfect thanks. Confirms that the overall decrease in readership is smallish. https://en.wikipedia.org/wiki/Template:WikiProject_Medicine/Popular_pages/Total [18:13:47] (CR) Nuria: [C: -1] Add Oozie job to extract data for WDQS research (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/311964 (https://phabricator.wikimedia.org/T146064) (owner: Nschaaf) [18:15:03] joal: I noted on WDQS CR that we cannot have ips on table, everything else I think looks good.. cc schana [18:27:10] so question... before I go crazy and update all the Pageviews tools, what do you think a reasonable cap would be for number of subsequent requests (at 10 req/s)? I'll explain... [18:27:51] right now Massviews and two other tools limit processing to 500 pages. There's also some logic to "schedule a retry" of any requests that failed due to the Cassandra backend error [18:28:00] those errors don't appear to be happening at all anymore [18:28:43] all perfect 200s, no lag... which is *awesome* [18:29:10] so would it be unreasonable to increase the cap to say, 5000 pages? so 5000 requests in perfect succession, 10 per second [18:30:37] Analytics: Kill limn1 - https://phabricator.wikimedia.org/T146308#2656945 (Nuria) [18:30:55] I shall note that there is only a small population of users that would need data on that many pages, namely GLAM [18:31:01] Analytics: Migrate the simplest limn dashboards to dashiki tabular {frog} - https://phabricator.wikimedia.org/T126358#2656963 (Nuria) [18:31:12] Analytics-Kanban: Kill limn1 - https://phabricator.wikimedia.org/T146308#2656945 (Nuria) [18:37:00] musikanimal: well, if the ratio is 10 reqs/sec you will not run into problems , provided that requests can return on a reasonable amount of time. Seems like your code assumes that requests might return in a sec, which might not be the case. Doing client side retry code is not easy, and if you do not have a decay for your retries it's kind of more like a DOS [18:37:01] than a retry. Now, pageview api should be able to sustain 10 reqs/sec fresh (no cache) without problem but that doesn't mean that you will not run into trouble as there are other users of API. Makes sense? [18:37:56] musikanimal: let me know if i need to provide more detail. [18:37:58] it does, thank you [18:38:13] the code currently puts every request in a queue, so you never get more than 10 req/s [18:38:21] including the retries [18:38:54] musikanimal: and it waits for prior requests to come back? [18:39:22] that it does not... heh [18:39:30] musikanimal: but I do not want to overcomplicate the situation, it is fine as it is. [18:40:05] if it allows me to increase the cap dramatically I'm happy to do the work [18:40:05] musikanimal: what i am saying is that "issuing request at 10 req/sec" doesn't translate on the receiving end seeing that ratio at all times. [18:40:13] sure [18:40:35] musikanimal: but, as i said, no need to overcomplicate things. [18:40:56] but you're saying I should wait for those 10 to complete before doing another 10? or at least when doing 5,000+ requests? [18:41:58] musikanimal: what i am saying is that to truly sustain a 10reqs/sec your code needs to keep track of requests in flight, that makes matters more complicated and probably in your case is unneeded. [18:42:30] gotcha, well I'll hold off for now... I should find out pretty quickly if this is a problem [18:42:46] and like I said most of the time we're working with maybe 100 or so requests [18:43:22] 5,000+ is an edge case. I guess there will be the curious users who put in Category:Living_people or something ridiculous like that [18:44:19] I'll cap at 5,000 for now, which will suffice for most people, I think [18:44:38] thank you for your help! :) [18:47:35] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657025 (Milimetric) Awesome articles, @GWicke. I generally agree that it's ok to switch transports later, if the world changes around us. But if we can make a good guess now, it'd save us... [18:52:22] Analytics-Dashiki, Analytics-Kanban, Need-volunteer: Vital-signs layout is broken - https://phabricator.wikimedia.org/T118846#2657031 (Milimetric) [19:31:36] schana: Please let me know if you need help learning how to test oozie, much documentation here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie [19:31:37] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657136 (Ottomata) Yeah, those are great articles, thanks Gabriel. I've got quite a bit of decisions paralysis now! I like that auto-resume via id feature of SSE/EventSource, but aside f... [19:39:06] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657166 (GWicke) > So, question: Is it possible for us to easily stream perhaps \r\n deliminted JSON objects via HTTP in a sane way now? And, what would that look like? Is using the Node htt... [20:28:52] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657362 (Ottomata) The upside is that the stream itself doesn't need to be a special format. It is just JSON objects. SSE would be useful if we needed to handle multiple different types of... [20:29:41] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657364 (Nuria) >I could be really wrong about this, but HTTP/2 Push seems like it is made for a different use case (app notifications) than a streaming firehose of data. I think we might be... [20:31:19] Analytics-Cluster, Operations, ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#2657378 (Dzahn) [23:37:48] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2657910 (GWicke) @ottomata, we don't need multiple kinds of messages, but we do need retry functionality, and don't necessarily want to stuff implementation dependent / transient ids into th...