[00:25:52] 10Analytics, 10MinervaNeue, 10Product-Analytics, 10Readers-Web-Backlog, and 2 others: [Spike ??hrs] Sticky header instrumentation - https://phabricator.wikimedia.org/T199157 (10Jdlrobson) A few questions (no urgency right now) > number of sections opened/closed per page @tbayer I made a few edits to htt... [06:25:47] joal: morninggg [06:26:04] I am sure that you want some excitement for today [06:26:08] so let's go with DISK WARNING - free space: /var/lib/druid 186865 MB (6% inode=99%): [06:26:16] druid100[1-3] :D [06:29:57] so the status is [06:29:58] /dev/mapper/druid1001--vg-druid 2.9T 2.5T 183G 94% /var/lib/druid [06:31:14] the segment_cache dir is 2.5T itself [06:31:45] worst offenders [06:31:46] 208G test_mediawiki_history_reduced_2018_07 [06:31:46] 897G webrequest_sampled_128 [06:31:47] 1020G mediawiki_history_beta [06:33:02] and we have [06:33:03] /etc/druid/historical/runtime.properties:13:druid.segmentCache.locations=[{"path":"/var/lib/druid/segment-cache","maxSize"\:2748779069440}] [06:34:06] so yes in theory we should be ok, druid should not cross any boundary [06:34:18] but of course alarms are firing anyway [06:35:48] ah no wait, because in /var/lib/druid there is not only the segment cache [06:35:54] I'd lower it a bit down then [06:36:14] to like 2.3/4 TB of max size [07:16:51] morning elukey :) [07:16:57] What a good start of day :) [07:17:26] elukey: test_mediawiki_history_reduced_2018_07 will be gone in minutes [07:19:02] elukey: And, obviously there is another glitch with mediawiki[-_]history[-_]beta ... [07:22:07] :) [07:22:52] And finally, I didn't notice (my BIG bad), but after we upgraded druid, all our retention rules have gone [07:23:15] ok - Let's clean that mess :) [07:32:20] ahhhh snap! [07:32:26] I didn't check either! [07:32:30] :( [07:32:36] do you need help Joseph? [07:34:11] elukey: not really - We can go over it together if you want, but I'm just disabling some datasources, adding rules, and droping data [07:34:18] ack [07:42:21] I am trying to make archiva work but I get a weird null pointer when using ldap [07:42:24] sigh [07:42:29] :S [07:57:34] elukey: I have the feeling we have an issue with overlord on druid1001 [07:58:57] Ah - maybe not ... [08:07:02] any weird log? [08:09:14] yes - error log in july, but since then looks like d1003 has the overlord master [08:30:53] elukey: I have the feeling it's better :) [08:32:46] JUST A LITTLE BIT! [08:32:47] /dev/mapper/druid1001--vg-druid 2.9T 323G 2.4T 12% /var/lib/druid [08:32:50] :D :D :D [08:32:53] ;) [08:33:02] so the retention policies right? [08:33:16] thanks a lot! [08:34:05] elukey: dropping test datasources and adding retention policies yes [08:34:47] elukey: and almost breaking everything by not remembering correct ordering in retention policy [08:37:45] ahahhaha [08:38:22] 10Quarry: Do the big Quarry migration - https://phabricator.wikimedia.org/T202588 (10zhuyifei1999) > Backup sql db and resultset folder of legacy live main instance I'm pretty sure the results live on NFS, so there isn't really a need to back this up. [08:39:06] joal: if you have time later on may I still your java debugging brain to know what is archvia doing? [08:39:17] elukey: sure ! [08:39:23] elukey: when you wish :) [08:40:05] if you have time we could briefly bc! [08:40:17] To the caaaaave ! [10:20:21] joal: all the druid public brokers went down [10:20:31] I had to restart them, aqs was (of course) complaining [10:21:11] the only thing that I found now is that a lot of segments were dropped right before it (maybe retention policies kicking in?) [10:21:21] not sure how those would have affected the brokers though [10:29:40] don't see a clear spike in requests for that time, and also nothing incredibly strange from the type of them (some are related to very old time window, like 2001, but not sure if relevant) [10:42:23] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10akosiaris) [11:09:31] archiva seems to work! [11:11:13] elukey: Thanks for having restarted the druid brokers [11:11:50] elukey: I did some cleaning on druid-public, but it shouldn't have affected the broker :( [11:12:24] elukey: maybe the cleaning was too much work for the hosts and brokers were stalled? [11:12:31] hm [11:12:38] also: \o/ for archiva !! [11:13:24] joal: I think it might have been the problem, but really weird :( [11:13:46] all right going out for lunch now, ttl! [11:13:52] elukey: this then means we should be super-gentle when cleaning druid-public ... hm [11:13:55] not nice [11:13:59] later elukey [11:16:03] a-team: family issue team, my wife needs to care and grand-mother, I'll care the kids from now on - I'm off for now [11:17:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Archiva (meitnerium) to Debian Stretch - https://phabricator.wikimedia.org/T192639 (10elukey) After a long battle I was able to log in on archiva1001 (via ssh tunnel) using my username, and getting the Archiva Admin right since I a... [13:07:31] (03PS1) 10Milimetric: Fix config [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/454809 [13:07:39] (03CR) 10Milimetric: [V: 032 C: 032] Fix config [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/454809 (owner: 10Milimetric) [13:21:04] milimetric: o/ [13:21:37] hey elukey [13:34:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Archiva (meitnerium) to Debian Stretch - https://phabricator.wikimedia.org/T192639 (10elukey) I had a very interesting chat with @Gehel and @dcausse about Archiva, and some good points were raised: * it would be great to get rid o... [13:35:29] milimetric: if you have time later on (I'l also ask to ottomata), would you mind to read --^ and tell me what you think about it? [13:36:03] k, np [13:40:24] elukey: I’m not really dealing with archiva much, joal might have more opinions. I vaguely also dislike that archiva-deploy user and agree ldap would be a nice replacement [13:40:43] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) After one day: ``` elukey@stat1005:~$ grep https ipv6_after_changes.log| while read line; do endpoint=$(echo $line | cut -d" "... [13:40:49] ahhh okok! [13:45:31] addshore: o/ [13:45:49] addshore: so I am tracking down the last cron that runs on stat1005 making https calls to lists.w.o [13:45:56] \o [13:46:05] let me have a look! [13:46:30] and this guy seems one of the two candidates [13:46:30] 0 3 * * * time /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts [13:46:35] yupp [13:46:45] i thought i update them all but let me see if imissed something! [13:46:51] thanks!!! [13:47:31] aaaah yes!!! they were indeed missed [13:47:35] ill make a patch in a sec! [13:48:59] * elukey dances [13:49:01] yessssss [13:53:18] (03PS1) 10Addshore: Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454819 [13:53:23] elukey: ^^ thats the one :) [13:53:59] (03CR) 10Addshore: [C: 032] Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454819 (owner: 10Addshore) [13:54:06] (03PS1) 10Addshore: Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454820 [13:54:25] (03CR) 10Addshore: [C: 032] Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454820 (owner: 10Addshore) [13:54:28] (03Merged) 10jenkins-bot: Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454819 (owner: 10Addshore) [13:54:30] sorry that I missed that before, didn't look for more things setting up their own curl instances : [13:54:31] :D [13:54:33] (03Merged) 10jenkins-bot: Use webproxy in mailmain related scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/454820 (owner: 10Addshore) [13:55:58] addshore: <3 [13:56:14] hope you wern't hunting around for too long :/ [13:57:06] no no there were also other crons doing https calls, these are the last ones so it is more difficult to track them down :) [13:57:10] thanks a lot! [13:57:29] no problem! [14:02:23] ottomata: o/ [14:02:28] ooo/// [14:02:33] hellooo [14:02:41] archiva + ldap works :) [14:04:12] elukey: i think we need open rsync for archiva............ yeah [14:04:13] git fat add [14:04:17] and push [14:04:19] uses rsync [14:04:24] i think... [14:04:38] that can't be right though... [14:04:47] elukey: ldap yeehawwww! [14:05:55] yes [14:06:06] ok, you can't push via the rsync bit, that's right [14:06:10] but, when you git pull local [14:06:15] with git fat [14:06:22] it uses rsync to get the files locally [14:07:05] https://wikitech.wikimedia.org/wiki/Archiva#Deploy_artifacts_using_scap3 [14:07:08] sure sure, but does it need to be opened to the public internet? [14:07:15] or just for us? [14:07:53] what I proposed was just to restrict to domain networks [14:08:01] it does [14:08:05] for you local working copy [14:08:10] you need to be able to git pull [14:08:19] ah ok got it [14:09:22] since we don't have a VPN for our set local set ups this needs to be public [14:09:39] yup [14:09:57] will comment on review ah [14:12:37] the other bit that we should discuss (whenever you have time) is how to configure archiva on archiva1001.. I asked around and https://phabricator.wikimedia.org/T192639#4526731 are some comments [14:12:41] elukey: i missed having so many phabs/gerrits to review when I wake up and start chekcing email! [14:12:43] glad you are back! :D [14:12:57] ahhahah sure, a good source of -1s to start the morning [14:13:56] elukey: what are repository groups? [14:14:32] are you suggesting we make new repos that mirror the remote ones exactly? [14:14:51] like have a archiva hosted 'central', 'cloudera' and 'spark' repo? [14:15:55] disclaimer: I learned today what those are, so not really an expert :) [14:16:00] but yes that is the idea [14:16:39] Gehel was proposing something similar since it might be less problematic for people that want only subset of deps rather than the whole big thing [14:17:04] in theory if we have a repo group called "mirrored" it should do exactly what we want [14:17:34] but not sure if worth it or not, this is the first time that I check archiva [14:17:41] it is a bit obscure to me :) [14:18:54] ottomata: not necessarily repo that mirror the full central (since we only use a very small subset), but proxy repo where we never upload manually. [14:19:28] The current "mirror" repo is a bit of a mess, where it is impossible to know which jar comes from where [14:20:01] (03PS1) 10Milimetric: Fix config again [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/454822 [14:20:19] (03CR) 10Milimetric: [C: 032] Fix config again [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/454822 (owner: 10Milimetric) [14:22:46] thanks gehel :) [14:23:18] elukey: yw, always happy to throw work your way :) [14:23:46] * gehel notes again that this is a case of small cost small benefit, not a game changing thing [14:24:57] gehel: right, but a remote repo central that mirrors to a local cached repo 'central' in archiva, right? [14:25:19] anytime someone dls a dep from central via archiva it would be cached in our archiva's central 'mirror' [14:25:37] like we do now, but instead of one 'mirrored' repo, separate 'central-mirrored' repos for each configured remote repo? [14:26:27] ottomata: If I remember correctly, we don't have proxying always enabled so that a random person can't trigger a download from central (current situation). This could be kept [14:26:49] gehel: i think we do have it enabled all the time [14:27:01] we didn't used to (the docs might say that), but it was so inconvenient that we left it proxying [14:27:08] What I don't like in our current setup, is the manual uploads to the "mirrored" repo. It is too error proned (we've had a few cases) [14:27:18] manual uploads to mirrored? ohhhhhh [14:27:19] right. [14:27:20] ok [14:28:14] atm, we have remote repos for centran, cloudera and spark-packages [14:29:12] what is the source of the jboss-related packages? Cloudera? (curious) [14:29:33] as far as I can see, those are proxies and are specific enough. It is mostly the "mirrored" repo that I find troublesome. It is very unclear what is in there, where it comes from and why it was used. [14:30:17] elukey: honestly, the jboss was an example (because it is a well known problematic repo). I think we have -jboss re-packaged deps in the mirrored repo, but I have not actually checked [14:30:27] ack! [14:31:47] The "wikimedia release" repo is also unclear as to what's in there. I would expect it to contain only packages maintained and build by WMF, but it does not seem to be the case [14:32:47] basically, in all the "managed" repos that we have, only the "python" repo has a clear meaning :) [14:33:46] gehel: i also would expect releases to only contained packages by WMF [14:33:50] if it doesn't, that is weird [14:34:08] to me the 3 repos were always clear, although I get your arguemnt about mirrored not being clear about where thigns come from [14:34:17] releases is for WMF versioned releases [14:34:27] snapshot is for WMF snapshot jars (which we don't really use that much) [14:34:37] and mirrored is for remote proxied dependencies [14:35:36] so we agree about the expectations ! Except for mirrored, which does not seem to be a remote proxied repo, but a repo with manual uploads [14:35:50] but the current reality seems to be far from our expectations! [14:36:24] (03PS1) 10Ottomata: Fix dependencies for turnilo 1.7.2, some weren't properly added to git last time [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/454827 (https://phabricator.wikimedia.org/T202011) [14:36:38] which probably means that this isn't clear enough to all users of those repos and that they have to think too much about how to use them [14:36:38] haha [14:36:39] i guess so! [14:36:47] (03CR) 10Ottomata: [V: 032 C: 032] Fix dependencies for turnilo 1.7.2, some weren't properly added to git last time [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/454827 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [14:39:06] I think that we could document all this into a wikitech page? [14:39:20] but I am not clear (yet) about the next steps [14:40:03] maybe there are no next steps. We have a mess probably because too many people had too much access to those repos [14:40:21] adding LDAP auth and more strict access would already solve some of that [14:40:35] we can think about cleanup once this is done [14:41:45] and tbh, the only way I found to keep repos clean was to have jenkins be the only one allowed to do any upload to the equivalent of snapshot and release repo. [14:42:14] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Operations, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10kchapman) Since there was broad agreement at the RFC meeting and hasn't been any objection raised since, TechCom has approved this. [14:42:32] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Operations, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10kchapman) [14:46:11] mmmm [15:00:41] elukey: hmm [15:00:45] a-team -- Still alone with kids, will miss standup - Luca knows about the druid fix, I also continued to work on XML-dumps importer [15:00:45] am i really not allowed to do this?! [15:00:46] NEW violations: [15:00:46] 14:59:59 modules/turnilo/manifests/proxy.pp:16 wmf-style: class 'turnilo::proxy' includes apache::mod::proxy_http from another module [15:00:47] i must be. [15:00:49] no? [15:03:02] is it working if you avoid the include but use class { 'etc..' : } ? [15:03:35] ping joal [15:03:53] (he is not joining today) [15:03:55] sorry A-team running 5 min late to standup [15:04:10] authenticating on hangouts is hard [15:05:06] ho ok [15:05:07] oh ok [15:17:34] nope elukey NEW violations: [15:17:34] 15:08:57 modules/turnilo/manifests/proxy.pp:16 wmf-style: class 'turnilo::proxy' declares class apache::mod::proxy_http from another module [15:17:34] :/ [15:43:26] ping fdans joal [16:03:00] 10Analytics, 10Analytics-EventLogging, 10Research: 20K events by a single user in the span of 20 mins - https://phabricator.wikimedia.org/T202539 (10Milimetric) p:05Triage>03High [16:04:02] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Research: 20K events by a single user in the span of 20 mins - https://phabricator.wikimedia.org/T202539 (10Milimetric) p:05High>03Normal a:03Nuria [16:04:33] 10Analytics, 10Research: Automate XML-to-parquet transformation for XML dumps (oozie job) - https://phabricator.wikimedia.org/T202490 (10Milimetric) p:05Triage>03High [16:04:43] 10Analytics, 10Analytics-Kanban, 10Research: Copy monthly XML files from public-dumps to HDFS - https://phabricator.wikimedia.org/T202489 (10Milimetric) p:05Triage>03High [16:05:32] 10Analytics: Transform EventLoggingToDruid job to read schemas to ingest from a whitelist and process them all - https://phabricator.wikimedia.org/T202312 (10Milimetric) p:05Triage>03Normal [16:05:56] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: [EL sanitization] Make cron send alert emails if job fails before calling refine - https://phabricator.wikimedia.org/T202429 (10Milimetric) p:05Normal>03High [16:08:37] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics: No recent data in the Edit event log in the Data Lake - https://phabricator.wikimedia.org/T202348 (10Milimetric) @Neil_P._Quinn_WMF so we moved this to radar because we can't ingest the events as they are right now. So th... [16:12:54] a-team: my internet is almost kaput [16:49:30] * elukey off! [16:49:33] (03PS4) 10Fdans: Add druid snapshot deletion script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/448551 (https://phabricator.wikimedia.org/T197889) [17:09:14] (03PS5) 10Fdans: Add druid snapshot deletion script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/448551 (https://phabricator.wikimedia.org/T197889) [17:32:47] elukey: can you think of why i might not be able to authenticate via ldap with the new vhost on analytics-tool1002? [17:33:19] mod auth is logging [17:33:21] AH01617: user Ottomata: authentication failure for "/": Password Mismatch [17:33:25] but it works fine on thorium [17:33:45] is there maybe some firewall rule that is keeping analytics-tool1002 from reaching ldap? [17:39:05] AHHHHHHHH MY PASSWORD CHANGED RECENTLY [17:39:07] IGNORE ME [17:39:08] wow [17:39:21] cool it works [17:56:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Archiva (meitnerium) to Debian Stretch - https://phabricator.wikimedia.org/T192639 (10Smalyshev) Sounds good, though about the authentication, I have a concern. In order to deploy to archiva (at least currently), I have to store us... [18:06:43] (03CR) 10Ottomata: [V: 032 C: 032] Upgrade superset to 0.26.3 for Stretch; prep for deployment to analytics-tool1003 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/454606 (https://phabricator.wikimedia.org/T201430) (owner: 10Ottomata) [18:55:09] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Reading-analysis, 10Patch-For-Review: Productionize per-country daily & monthly active app user stats - https://phabricator.wikimedia.org/T186828 (10chelsyx) > It is irrelevant only if one doesn't care about the definition of "unique app user" ;)... [20:18:48] 10Analytics, 10MinervaNeue, 10Product-Analytics, 10Readers-Web-Backlog, and 2 others: [Spike ??hrs] Sticky header instrumentation - https://phabricator.wikimedia.org/T199157 (10nettrom_WMF) a:03Tbayer [20:40:50] 10Analytics: Calculate precisely number of unqiue users for IOS and Android in a privacy Conscious manner that does not require opt in - https://phabricator.wikimedia.org/T202664 (10Nuria) [20:41:19] 10Analytics: Calculate precisely number of unqiue users for IOS and Android in a privacy conscious manner that does not require opt in - https://phabricator.wikimedia.org/T202664 (10Nuria) [20:43:22] 10Analytics: Calculate precisely number of unqiue users for IOS and Android in a privacy conscious manner that does not require opt in to send data - https://phabricator.wikimedia.org/T202664 (10Nuria) [20:44:15] 10Analytics: Calculate precisely number of unqiue users for IOS and Android in a privacy conscious manner that does not require opt in to send data - https://phabricator.wikimedia.org/T202664 (10Nuria) [20:44:38] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Reading-analysis, 10Patch-For-Review: Productionize per-country daily & monthly active app user stats - https://phabricator.wikimedia.org/T186828 (10Nuria) >For the old tables wmf.mobile_apps_uniques_daily and wmf.mobile_apps_uniques_monthly, whic... [20:51:32] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) Oo, just ran into a problem with Hue that I need a little help with from some debian packaging pros. Let's try @MoritzMueh... [20:54:05] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) @Dzahn suggests: > i think.. the proper way would be to unpack the hue-common packags, find the "Depends" line in the cont... [21:01:07] 10Analytics, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) We're finally implementing the feature it's needed for, so could you please prioritize this? [21:49:40] 10Analytics, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) @Pchelolo Since the accept header is of little interest (i think, please correct me if I am wrong) for other than this question can't we extract this temporary data from a varnish dump... [21:52:38] 10Analytics, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) @Nuria I wasn't aware of that option. Basically, we will need to do this several times over the course of next months. If it's easy enough to do - I'm all for it. We would need qui... [22:18:34] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10Page-Issue-Warnings, and 6 others: Provide standard/reproducible way to access a PageToken - https://phabricator.wikimedia.org/T201124 (10Jdlrobson) a:05Nuria>03Tbayer Skipping sign off. QA will be handled within T191532.... [22:45:32] 10Analytics, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) pinging @Ottomata and @elukey to make sure they are ok with idea [22:54:07] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics: Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Neil_P._Quinn_WMF) [22:57:41] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics: Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Neil_P._Quinn_WMF) >>! In T202348#4527255, @Milimetric wrote: > @Neil_P._Quinn_WMF so we moved this to radar because we can't ing...