[00:09:03] bearloga: hope i got this right https://www.mediawiki.org/w/index.php?title=Wikimedia_Product&diff=2196771&oldid=2196739 https://www.mediawiki.org/w/index.php?title=Wikimedia_Product&diff=2196854&oldid=2196771 [00:14:39] HaeB: ? [00:17:38] bearloga: it seems you only intended to change one metric (in the discovery section), but accidentally reverted most of the rest of the page to an earlier version in the process [00:18:10] i tried to fix it, but wanted to make sure i didnt erase any intentional changes in the process [00:18:11] HaeB: OOF! Sorry about that!!! [00:18:32] no worries! [00:18:51] it's a wiki ;) [00:19:14] HaeB: :) yeah, the intentional changes look alright! thank you! [00:33:11] Analytics, Editing-Analysis, Performance-Team, VisualEditor, Graphite: Statsv down, affects metrics from beacon/statsv (e.g. VisualEditor, mw-js-deprecate) - https://phabricator.wikimedia.org/T141054#2485441 (ori) Odd: ``` $ service statsv status ● statsv.service - statsv Loaded: loaded (... [07:15:02] gooood morning! [07:15:06] we start the day with [07:15:07] Notice: /Stage[main]/Role::Analytics_cluster::Refinery::Source/Git::Clone[refinery_source]/Exec[git_pull_refinery_source]/returns: fatal: empty ident name (for ) not allowed [07:16:41] now after " Tried to deploy refinery-source using jenkins but there is a config issue" I suspect that something has been done to the refinery repo right? joal? [07:17:54] even if it seems a git config [07:17:56] mmmmm [07:21:18] ah but this is refinery_source [07:21:24] so definitely something weird is happening [08:00:48] now I am not sure why user.name or user.email is not there anymore [08:08:34] but these are in /var/lib/stats/.gitconfig [08:14:23] https://phabricator.wikimedia.org/T141062 [08:14:36] joal: let me know when you have a minute to chat about stat1002 [08:17:25] completely different subject: from https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system I can see that before the last load job the cassandra sstables size went down to ~140GB per instance, that is more or less what we were expecting.. [08:17:53] also last compaction last 3 days more or less, and this seems to be a huge result [08:27:10] Analytics-Cluster, Analytics-Kanban, Deployment-Systems, scap, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2486313 (elukey) I can see other keys expired with gpg --list-keys. @mark, @yuvipanda, @chasemp: would you mind to double check your gpg k... [08:27:14] Analytics, Editing-Analysis, Performance-Team, VisualEditor, Graphite: Statsv down, affects metrics from beacon/statsv (e.g. VisualEditor, mw-js-deprecate) - https://phabricator.wikimedia.org/T141054#2485441 (fgiunchedi) the dns failures were likely due to network maintenance, at some point w... [08:31:10] Hi elukey [08:31:16] Here I am ! [08:33:44] addshore: afaiu https://phabricator.wikimedia.org/T140342 is done right? [08:33:47] joal: o/ [08:33:57] elukey: \o [08:34:00] I left some ramblings as always for you [08:34:09] I have read yes [08:34:27] elukey: I don't see any reason for the git issue being related to what I experienced yesterday [08:34:51] gooood [08:35:13] what a joy [08:35:15] Yesterday I didn't even deploy: I looked at the jenkins proposed config, and there was something wrong [08:35:24] So I left it there :) [08:36:00] madhuvishy said it was fix, so I'll check now, but won't deploy (Friiiiiiidays ...) [08:36:15] so very stupid question [08:36:24] indeed now the jenkins config looks good [08:36:39] addshore: I'll deploy on Monday - first thing [08:36:48] the refinery needs refinery_source in which we have all the jars, built via Jenkins by Madhu's bot right? [08:37:09] elukey: I'll write my understanding of it: [08:37:17] * elukey grabs a coffee [08:38:49] refinery-source is java/scala code - It is deployed as jars in archiva (various maven submodule) - The deploy is now done by jenkins (see https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Refinery-source#How_to_deploy_with_Jenkins) [08:40:15] Analytics-Cluster, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486325 (Gehel) p:Triage>High It seems that there are local commits on stat1002: ``` stats@stat1002:/a/refinery-source$ git status On branch master Your branch and 'ori... [08:40:30] refinery is oozie-config/python scripts - It will soon be deployed using scap3 ;) - It references jars from refinery-source using git-fat [08:40:37] Joel, awesome! [08:41:11] addshore: :) [08:41:30] addshore: I'd have done yesterday but we are still discovering the new deploy process :) [08:41:32] I also may have another couple of jobs in the works if I can find the time! [08:41:36] I saw ;) [08:42:22] elukey: One thing I don't get is: WHY do stat1002 pull refinery-source??? [08:42:43] elukey: references to functions in the repo are always done through jars [08:44:05] I have no idea [08:44:16] it ensures that refinery source is up to date [08:45:08] elukey: seems tech-debt ... We should check that no cron jobs use that folder (/a/refinery-source) and probably kill the puppet sync [08:45:20] elukey: BEtter to wait for ottomata though :) [08:45:54] yeah [08:48:05] elukey: also, you're right about cassandra data-size: we went down to about 130Gb per instance, which is what was expected [08:48:22] elukey: I commented here: https://phabricator.wikimedia.org/T140866 [08:49:28] elukey: completely different topic: thanks for the support yesterday [08:49:35] elukey: I was worried [08:49:42] elukey: Today, everything back to normal ! [08:49:45] :D [08:54:27] I need to investigate what was going on, but maybe if everything is ok I can postpone.. my guess is that some timeouts are happening for some requests and this affects writes too [08:54:31] maybe new traffic etc.. [08:54:40] I want also to check the pageview dashboar [08:54:44] dashboard [08:54:46] elukey: right [08:55:07] joal: also the compaction seems to have completed in 3 days right? [08:55:13] moderately good news? [08:55:22] that was one month of data right? [08:56:12] elukey: first month of loading: 2.3days global (load + compact) [08:56:19] * elukey dances [08:56:23] now, waiting for second month [08:56:39] let's stop after the second so I'll be able to restart the nodes ok? [08:57:03] also: when changing compactors and compaction throuput, the compaction goes faster (not dramatically, but faster) [08:57:35] elukey: If it ends during weekend, I'll launch 3rd month, if it's after, we'll wait :) [08:57:50] sure :) [08:58:56] Interestingly, the change in compRESSION makes the number of compACTION tasks divided by two ! (seems correlated with data size being divided by two as well :) [08:59:00] elukey: --^ [08:59:24] (PS1) Addshore: Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300509 [08:59:26] that is super good [08:59:43] something is finally going in the right direction [09:00:07] (PS1) Addshore: Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300510 [09:00:14] (CR) Addshore: [C: 2] Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300510 (owner: Addshore) [09:00:23] (CR) Addshore: [C: 2] Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300509 (owner: Addshore) [09:00:36] elukey: now we need to see if load grows linearly with already-loaded-data-size, or if it flattens [09:00:49] (Merged) jenkins-bot: Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300510 (owner: Addshore) [09:01:00] (Merged) jenkins-bot: Catwatch script ignore labtestwiki db [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300509 (owner: Addshore) [09:02:27] elukey: crazy cool : http://www.theverge.com/2016/7/21/12246258/google-deepmind-ai-data-center-cooling [09:03:25] wow [09:06:27] Analytics-Cluster, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486357 (elukey) We have a jenkins job (https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Refinery-source) that releases new refinery source jars to Archiva so I am not sure... [09:06:46] Analytics-Cluster, Analytics-Kanban, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486359 (elukey) [09:07:32] addshore: https://phabricator.wikimedia.org/T140342 - done? [09:08:01] not done yet! [09:08:25] elukey: https://gerrit.wikimedia.org/r/#/c/299522/ [09:12:39] addshore: ah sure it makes sense :) [09:12:49] I don't recall if it was discussed in a ops meeting though [09:13:01] because each sudo needs to be discussed in there first [09:13:50] will chat with Andrew but I don't see any issue [09:13:55] worst case will be a merge on Monday [09:14:02] per dzahn in https://gerrit.wikimedia.org/r/#/c/298928/ 18 July "looks alright, in ops meeting it was said that analytics should ack it" but not sure what was said of course :) [09:14:09] Monday would be awesome! :) [09:14:45] yeah probably even today, but Monday is the worst case scenario since we'll have the ops meeting [09:58:12] joal: does https://graphite.wikimedia.org/S/Bg make any sense to you? [09:59:12] elukey: hmmm . not really [09:59:24] and also https://graphite.wikimedia.org/S/Bh [10:00:25] If rate is what I expect, I am seeing 5K r/s ? [10:00:40] for 2xx GETs [10:01:08] sample_rate seems completely misleading [10:01:13] mobrovac: aloha [10:01:24] elukey: can't be per sec --> we have thotlling way lower than that [10:01:29] do you have time to enlight a restbase user about your metrics? [10:02:19] joal: maybe those are req served from frontend restbase? And throttling is only for misses? [10:02:27] elukey: what's up? [10:02:54] nope elukey, throtlling os for reqs per sec per ip I think [10:03:40] mobrovac: is there any doc related to the metrics generated by restbase for a service like AQS? [10:03:55] for example https://graphite.wikimedia.org/S/Bg confuses me a lot [10:04:05] the diff between sample_rate and rate is a bit weird [10:04:11] but I am sure that I am missing something [10:04:28] so all routes are automatically exposed through metrics [10:05:00] the first part is the prefix - restbase.external - that means that the requests have been made by external clients (not in our prod) [10:05:08] and then you have the endpoint [10:05:21] (also I just discovered https://grafana.wikimedia.org/dashboard/db/restbase) [10:05:50] yes, that one is really useful [10:07:04] mobrovac: what is the diff between .rate and sample_rate? I can see 2xx gets rate peaking to 5K meanwhile the sample_rate is orders or magnitude lower [10:07:08] it is a bit weird [10:07:39] iirc, sample_rate is averaged over a sliding window [10:07:47] so peeks are evened out [10:08:13] hm no [10:08:16] that's not it [10:08:17] hm [10:08:28] and those metrics are related to the AQS endpoint right? Not restbase [10:08:50] those metrics tell you what RB sees when it makes reqs to AQS [10:09:13] here, it's the req rate, and there are others that tell you the latency as perceived by RB, not the client [10:10:15] sure sure [10:10:43] I would have expected sample_rate to be less precise but not orders of magnitude different [10:11:10] also 5k r/s (even peaks) don't make a lot of sense for AQS [10:11:19] the cluster would have been burned out way before [10:12:03] hmmm [10:16:10] probably I am missing something trivial [10:16:16] this is why I was asking for docs [10:26:39] joal: https://grafana.wikimedia.org/dashboard/db/aqs-elukey [10:27:32] elukey: I use this one (not that far) https://grafana.wikimedia.org/dashboard/db/pageviews [10:30:09] I know :) [10:30:15] but it is not complete [10:30:23] I forked a new one to show my point [10:30:46] we don't show top/aggregate and uniques (even if much smaller of course) so we can't see spikes [10:31:53] anyhow, I am still completely puzzled about these metrics [10:32:20] :) [10:32:26] mobrovac while you're not far [10:32:31] mobrovac: hiiiiii :) [10:33:03] mobrovac: Do you have an idea why cassandra has a "rest" period after loading before starting heavy compaction ? [10:33:17] mobrovac: Is that because compactions are triggered daily or something? [10:33:31] mobrovac: If you don't know I'll ask urandom later today :) [10:39:34] afaik compactions are periodic, but also triggered by certain events, such as a surge of writes [10:39:52] also, each node manages its compactions individually [10:46:12] https://grafana-admin.wikimedia.org/dashboard/db/aqs-elukey - better [10:46:24] still weird sample_rate vs rate values [10:46:28] will try to investigate [11:31:13] addshore: code review merge [11:31:15] *merged :) [11:31:49] thanks elukey ! :) [11:34:24] super welcome [11:57:02] * elukey lunch! [12:47:08] mooorning! [12:48:10] ottomata: Hellooooo [12:48:22] (CR) Ottomata: [C: 1] Initial basic configuration for the Refinery Scap repository. [analytics/refinery/scap] - https://gerrit.wikimedia.org/r/299714 (https://phabricator.wikimedia.org/T129151) (owner: Elukey) [12:48:26] ottomata: When you have a minute, do we go over schemas? [12:49:17] joal: might be better for me to do monday, if you don't mind, i don't think i'll get to work on them today anyway [12:49:30] Good for me :) [13:02:00] Analytics-Cluster, Analytics-Kanban, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486860 (Ottomata) Open>Resolved a:Ottomata refinery-source is cloned by `role::analytics_cluster::refinery::source`, and mainly exists just to... [13:03:54] Analytics-Cluster, EventBus, Operations, Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2486867 (Ottomata) [13:04:50] Analytics, Patch-For-Review: Upgrade kafka main clusters - https://phabricator.wikimedia.org/T138265#2486868 (Ottomata) [13:10:08] Analytics-Cluster, Analytics-Kanban, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486877 (Gehel) As far as I can see, someone ran `mvn release:prepare && mvn release::perform` which does a bit more than `mvn package`. The release will cr... [13:13:13] ottomata: o/ https://phabricator.wikimedia.org/T141062 is still a bit weird to me [13:13:42] elukey: it is to me a little bit too [13:14:19] is there any doc about allowed procedures in there? I am super ignorant about maven so it might be straightforward [13:14:28] Analytics-Cluster, Analytics-Kanban, Operations: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486881 (Ottomata) Ah! I bet Madhu did this when she was developing jenkins deployments. Not sure. [13:14:55] yes I was thinking the same --^ [13:15:38] elukey: if i remember correctly, we created a refinery-source clone just so we wouldn't have to copy it or clone it ourselves all the time if we wanted to test building [13:15:39] HMMMM [13:15:46] or maybe we did make it to do releases... [13:15:59] so we wouldn't have to wait so long to upload jars from our laptops to archiva [13:16:03] (PS1) Addshore: Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300535 [13:16:15] (PS1) Addshore: Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300536 [13:16:20] but theoretically now with Madhu's machinery we shouldn't need it anymore right? [13:16:21] (CR) Addshore: [C: 2] Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300536 (owner: Addshore) [13:16:31] right [13:16:35] (CR) Addshore: [C: 2] Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300535 (owner: Addshore) [13:16:52] (Merged) jenkins-bot: Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300536 (owner: Addshore) [13:17:11] ok so I'll ask to her later on [13:17:11] (Merged) jenkins-bot: Actually run second betafeatures query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/300535 (owner: Addshore) [13:17:20] surely we can think about nuking the repo [13:17:37] hm, perhaps! doesn't sound like we use it [13:18:03] all right [13:18:06] maybe a post-standup [13:58:26] (PS7) Milimetric: [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) [13:59:23] (CR) jenkins-bot: [V: -1] [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) (owner: Milimetric) [14:29:15] Analytics-Cluster, Analytics-Kanban: Puppetize and deploy MirrorMaker using confluent packages - https://phabricator.wikimedia.org/T134184#2487144 (Ottomata) [14:32:48] ottomata: I have scala messing with my head :) [14:34:47] * elukey is scared [14:34:50] :P [14:35:07] * joal just had a computing alucination [14:35:21] s/al/hal [14:37:12] elukey: IT DID AGAIN !!!!! [14:38:37] joal: hahaha [14:38:39] yeah? [14:38:49] ottomata: batcave for screenshare? [14:38:53] sure [14:40:30] trying... [14:41:19] ottomata: no good [14:41:23] no [14:41:23] hm [14:41:29] seems to be internet [14:41:31] ok, i want to run home before standup anyway [14:41:34] gimme 15 mins? [14:41:41] I'll be catching Lino [14:41:43] oh [14:41:45] hm [14:41:55] ottomata: I'll catch up after Lino's diner :) [14:42:07] o [14:42:07] k [14:42:16] I mean, what I have seen I have been able repro, so I'm confident it'll do it again :) [14:42:35] * joal has seen a sparklucination [14:50:22] * elukey has been alucinating the whole afternoon doing ops tickets and listening to Autechre [14:50:45] * elukey blames joal [14:51:14] * joal is always a good culprit :) [14:53:41] * joal has finally found the reason of the hallucinations [14:54:00] * joal can now get back to normal life [14:54:24] * elukey is happy for joal [14:57:35] madhuvishy: hellooooooooo! We were wondering if you played with the refinery_source on stat1002 yesterday [14:59:01] mmmm from last it sems that you connected only on Jul 19 [14:59:37] we are wondering what happened in T141062, if you have time to review it let me know :) [14:59:37] T141062: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062 [15:15:34] elukey: I haven't touched stat1002 or refinery-source in a while [15:17:21] madhuvishy: o/ [15:17:32] yeah I checked last access only after writing [15:18:05] And I didn't try to play with the maven release things on stat1002 ever too. Not sure what was up [15:22:41] yeah, super weird.. do you think that we still need that repo on stat1002 after your jenkins awesome machinery? [15:22:44] madhuvishy: --^ [15:22:53] I was thinking to propose a nuke [15:29:25] elukey: you might also be interested in https://grafana.wikimedia.org/dashboard/db/pageviews [15:30:04] re compactions, that depends heavily on which compaction strategy you are using [15:39:36] elukey: I had no idea that the repo even existed there [15:39:46] Yeah I'm not sure why its needed [16:00:39] a-team: still in a meeting, will be a bit late to standup sorry! [16:22:12] elukey: we had a repo on 1002 to make it easy to test changes [16:22:19] cc ottomata [16:22:42] Analytics, Reading-analysis, Research-and-Data, Research-consulting: Report on Wikimedia's industry ranking - https://phabricator.wikimedia.org/T141117#2487657 (leila) [16:22:47] as sometimes testing a change involved scp-ing quite a few of jars [16:38:55] Analytics, Reading-analysis, Research-and-Data, Research-consulting: Report on Wikimedia's industry ranking - https://phabricator.wikimedia.org/T141117#2487747 (leila) [17:05:14] Analytics, MediaWiki-API, User-bd808: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321#2487915 (bd808) p:Triage>Normal [17:07:08] nuria_: hola! Do you think that the repo on stat1002 would be still needed? [17:07:17] if not we can nuke it [17:07:32] elukey: for development, ya, i think it is useful so you can build locally [17:08:10] elukey: but on meeting now, i can talk on more detail later to make sure i understand teh details [17:08:12] *the [17:08:30] sure sure! I'll bring it up on monday's standup :) [17:17:20] going afk team! byyyeeee [17:17:23] have a good weekend! [17:20:11] laters! [17:20:39] Have a good weekend elukey :) [17:20:55] ottomata: I found the hallucinogen, no need to bother you anymore :) [17:21:08] madhuvishy: o/ [17:21:45] haha, ok joal! [17:21:49] am here if you need me [18:20:02] Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, and 2 others: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2488235 (matmarex) Well… @marktraceur, when did you last check? ;) [18:28:47] Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, and 2 others: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2488362 (matmarex) ``` MatmaRex: Sorry, I'm not at my computer right now, but I last checked a week or two ago when I w... [21:31:41] Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, and 2 others: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2488813 (Legoktm) The library we use for extension.json validation doesn't support localized error messages. It's also only a dev-dep... [22:08:19] hey gang! what's the preferred way for people to make inquiries about the most viewed pages (if a given article is a false positive, etc)? the mailing list? or phab report? [22:08:32] specifically this is for the English Wikipedia Signpost [23:09:28] musikanimal: inquires? like what are top pages? [23:09:52] musikanimal: maybe you can give us an example of an "inquire" [23:09:56] https://en.wikipedia.org/wiki/Wikipedia_talk:Top_25_Report#Who.27s_doing_the_analytics_now.3F [23:10:20] they apparently used to go to Ironholds to get answers [23:11:45] the Signpost team has it's own system of determining false positives, but sometimes it's unclear so they ask if more investigation can be done [23:12:09] musikanimal: our team doesn't do data diving , reading and editing teams have analysts that can be contacted. see: https://wikitech.wikimedia.org/wiki/Analytics/DataRequests [23:12:18] dah [23:12:33] musikanimal: we provide infrastructure/access and help troubleshooting either but we do not handle data requests [23:15:00] alrighty, so these folks? https://www.mediawiki.org/wiki/Editing/Analysis [23:16:31] musikanimal: for editing data, but sounds like you are after pageview data which will be reading and tilman [23:16:38] cc HaeB [23:16:40] yeah that's why I'm confused [23:17:45] the Signpost folks mentioned having "access" to figure this kind of stuff out, not sure what they mean or how the Ironholds was even able to draw conclusions [23:19:26] musikanimal: ironholds worked as data analyst for discovery that probably worked with signpost folks? [23:19:34] yeah [23:19:56] it's only but so often the Signpost needs this kind of info [23:21:48] musikanimal: edited our page so it is more clear where you go, added "data about readers/pageviews" [23:22:00] but we don't store very specific reader data on a per-article basis, right? so I'm still confused how we could even draw conclusions beyond the data that is already publicly available [23:22:11] thanks [23:23:20] musikanimal: because for teh last 60 days we have more detailed data, it gets periodically deleted [23:23:34] I see [23:23:35] musikanimal: we just store it temporarily [23:24:26] got it. So Tilman is their new man. Thank you! [23:26:15] ah, looks like Tilman was actually the editor-in-chief for the Signpost at one point [23:52:50] Analytics: Default date selection to currently applied date - https://phabricator.wikimedia.org/T141165#2489086 (Milimetric) [23:54:45] Analytics: Timeseries on browser reports broken when going back 18 months - https://phabricator.wikimedia.org/T141166#2489099 (Milimetric) [23:55:25] Analytics: Default date selection to currently applied date for browser reports - https://phabricator.wikimedia.org/T141165#2489111 (Milimetric)