[00:01:00] HaeB: mailx is called heiloom-mailx on debian [00:01:29] HaeB: echo "hola" | heirloom-mailx -s "hola" some@some.com [00:01:34] tbayer@stat1006:~$ mailx --version [00:01:34] mailx (GNU Mailutils) 3.1.1 [00:01:34] Copyright (C) 2007-2016 Free Software Foundation, inc. [00:01:34] License GPLv3+: GNU GPL version 3 or later [00:01:34] This is free software: you are free to change and redistribute it. [00:01:34] There is NO WARRANTY, to the extent permitted by law. [00:02:18] tbayer@stat1006:~$ heirloom-mailx [00:02:18] -bash: heirloom-mailx: command not found [00:03:00] tbayer@stat1006:~$ echo "hola" | heirloom-mailx -s "hola" tbayer@wikimedia.org [00:03:00] HaeB: ah sorry i was on 1005 [00:03:00] -bash: heirloom-mailx: command not found [00:03:24] ok good to know for stat1005 too ;) [00:06:59] HaeB: ah, i see mailx on 1006 too, you can use it like: echo "hola" | mailx -s "hola" some@some.com [00:07:27] HaeB: but not sure about mailbox, i imagine otto will need to look into that [00:13:07] ok thanks, that works and is actually all i need (it is in fact how i have been using it all along in scripts on stat1003, but i didn't expect it work here after "mailx" by itself caused that error) [00:30:30] (03CR) 10Nuria: "I was finally able to test this with all tags available and i think it works pretty well. If tags are ["wikidata-query","sparql"] you can" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/367940 (https://phabricator.wikimedia.org/T171760) (owner: 10Nuria) [00:32:43] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Wrong JVM heap size set for Hive* daemons - https://phabricator.wikimedia.org/T172107#3491708 (10Nuria) 05Open>03Resolved [00:32:58] 10Analytics-Cluster, 10Analytics-Kanban: Provision new Kafka cluster(s) with security features - https://phabricator.wikimedia.org/T152015#3491713 (10Nuria) [00:33:00] 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Perf test RAID vs JBOD with new hardware and kafka versions - https://phabricator.wikimedia.org/T168538#3491711 (10Nuria) 05Open>03Resolved [00:33:12] 10Analytics-Kanban, 10User-Elukey: Upgrade AQS to node 6.11 - https://phabricator.wikimedia.org/T170790#3491714 (10Nuria) 05Open>03Resolved [00:33:27] 10Analytics-Kanban, 10Patch-For-Review: Mediawiki History Druid indexing failed - https://phabricator.wikimedia.org/T170493#3491718 (10Nuria) 05Open>03Resolved [00:33:37] 10Analytics-Kanban, 10Patch-For-Review: Use hive dynamic partitioning to split webrequest on tags - https://phabricator.wikimedia.org/T164020#3491721 (10Nuria) [00:33:40] 10Analytics-Kanban, 10Patch-For-Review: Create tagging udf - https://phabricator.wikimedia.org/T164021#3491720 (10Nuria) 05Open>03Resolved [00:33:50] 10Analytics-Dashiki, 10Analytics-Kanban, 10MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), 10Patch-For-Review, 10Wikimedia-log-errors: Warning: JsonConfig: Invalid $wgJsonConfigModels['JsonConfig.Dashiki'] array value, 'class' not found - https://phabricator.wikimedia.org/T166335#3491722 (1... [00:34:03] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3491724 (10Nuria) 05Open>03Resolved [00:34:12] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3491727 (10Nuria) 05Open>03Resolved [00:34:28] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging: Ensure that `meta.request_id`'s auto MySQL index is not unique, while `meta.id` should be - https://phabricator.wikimedia.org/T171489#3491731 (10Nuria) 05Open>03Resolved [00:34:49] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Replacement of stat1002 and stat1003 - https://phabricator.wikimedia.org/T152712#3491738 (10Nuria) [00:34:51] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move statistics::discovery jobs from stat1002 -> stat1005 - https://phabricator.wikimedia.org/T170471#3491737 (10Nuria) 05Open>03Resolved [00:35:00] 10Analytics-Kanban: Practice with photorec - https://phabricator.wikimedia.org/T171972#3491739 (10Nuria) 05Open>03Resolved [00:35:12] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Replacement of stat1002 and stat1003 - https://phabricator.wikimedia.org/T152712#3430411 (10Nuria) [00:35:14] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Unmount /a partition from 1002 - https://phabricator.wikimedia.org/T171373#3491740 (10Nuria) 05Open>03Resolved [00:36:13] 10Analytics-Kanban, 10Patch-For-Review: Create purging script for mediawiki-history data - https://phabricator.wikimedia.org/T162034#3491743 (10Nuria) We can merge this change as soon as we tested, just noting here we need companion puppet change for it to be effective. [00:44:24] 10Analytics-Kanban, 10Analytics-Wikistats, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): Fix Wikistats build in Jenkins - https://phabricator.wikimedia.org/T171599#3491749 (10Nuria) mmm.. i think the tests might need to initialize semantic [01:43:32] 10Quarry: Gigantic query results cause a SIGKILL and the query status do not update - https://phabricator.wikimedia.org/T172086#3491791 (10zhuyifei1999) The current storing query status in the same process as the storing query results isn't going to work. SIGKILL cannot be caught, so only the celery master proce... [09:51:09] 10Analytics-Kanban, 10User-Elukey: Archive PageContentSaveComplete in hdfs while we continue collecting data - https://phabricator.wikimedia.org/T170720#3440380 (10elukey) It would be really great to progress this task asap so we'll free a ton of space on analytics-store :) [10:56:59] * elukey lunch! [12:46:09] FYI just suspended webrequest load to allow hive tasks to drain (need to restart the hive daemons again, last JVM change) [13:05:49] (03CR) 10Ottomata: [C: 032] Adding new wiki to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/369573 (owner: 10Nuria) [13:05:51] (03CR) 10Ottomata: [V: 032 C: 032] Adding new wiki to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/369573 (owner: 10Nuria) [13:09:25] (03PS3) 10Zhuyifei1999: Remember recent queries filter last used by a user. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/176506 (https://phabricator.wikimedia.org/T76084) (owner: 10Rtnpro) [13:09:42] (03CR) 10jerkins-bot: [V: 04-1] Remember recent queries filter last used by a user. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/176506 (https://phabricator.wikimedia.org/T76084) (owner: 10Rtnpro) [13:20:46] elukey: hiiiii wanna start ops sync early and talk about kafka stuff? [13:30:16] ottomata: hiiiii! Just produced some messages from Varnishkafka to Kafka via SSL in labs \o/ [13:30:34] there are some ACL weirdness that I need to fix [13:30:45] but for the moment everything works [13:30:50] the patch is super straighforward [13:30:54] only a change to config.c [13:31:25] NIIIICE! [13:31:32] that's awesome [13:31:55] ok for ops sync, whenever you prefer! [13:32:01] lets do it, got thoughts... :) [13:32:18] am in bc [13:33:00] joining [14:16:16] a-team fyi luca and I are going to try to upgrade druid agin [14:16:17] we have an idea [14:21:56] ugh, what does that error message mean? (aborting a query in beeline on stat1004): [14:21:59] Unknown HS2 problem when communicating with Thrift server. [14:21:59] Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed) (state=08S01,code=0) [14:22:19] !log suspending druid oozie jobs [14:22:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:22:38] HaeB: my fault sorry, just had to restart hive-server :( [14:22:52] last jvm setting I promise, will not doit again.. I didn't see your query in yarn [14:23:28] two actually :/ (another on stat1005) [14:23:44] HaeB: sorry :( [14:23:59] !log restart hive-server to pick up JVM Xms4g change [14:24:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:24:57] elukey: ok, thanks for the explanation! no worries, will restart them (fortunately each should be less than 1h) [14:25:57] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493504 (10Ottomata) Alright! Luca and I have tested some things, and discussed this migration a little more. We're going t... [14:34:10] !log beginning druid upgrade to 0.92 (take 2 :) ) [14:34:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:47:54] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493568 (10elukey) Note about disk config: we are going for a 12 disk RAID10 partition plus a raid1/10 root one. I can work o... [14:55:04] milimetric: I think we did it! [14:56:07] what?! but you said it was impossible [14:56:13] you did the impossible!! [15:03:32] 10Analytics-Kanban, 10Patch-For-Review: Upgrade Druid to 0.9.2 as a temporary measure - https://phabricator.wikimedia.org/T170590#3493616 (10Ottomata) ! I think we did it. Yesterday I was juggling just too much stuff to realize what I had done wrong. https://gerrit.wikimedia.org/r/#/c/355469/ had not yet be... [15:48:36] milimetric, ping :] [15:49:09] hey [15:49:10] cave [15:49:30] y [16:17:57] (03PS7) 10Milimetric: Implement Wikistats metrics as Druid queries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365806 (https://phabricator.wikimedia.org/T170882) [16:23:49] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493822 (10RobH) @elukey: I'm happy to help with partman, but I want to confirm: These systems have dual 1TB OS disks, which... [16:26:30] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493832 (10Ottomata) +1 @RobH [16:27:43] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493841 (10elukey) It seems good to me, for some reason I was under the impression that we preferred sw raid vs hw controlled... [16:30:31] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493852 (10Ottomata) I'm not sure of a reason to prefer software raid other than ease of management. Likely performance is b... [16:33:25] 10Analytics, 10Android-app-feature-Compilations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine how to gather top-viewed article lists for use in generating ZIM files - https://phabricator.wikimedia.org/T172296#3493862 (10Mholloway) [16:41:50] ottomata: just tested your suggestion for vk, it works with kafka. :) [16:42:02] I'll update the code review to just include the new parameters [16:42:05] so no code change! [16:42:25] great! [16:44:17] just tested in labs [16:46:18] ottomata: https://gerrit.wikimedia.org/r/#/c/369689 [16:46:52] thanks for the review! For some reason I thought that it was better to have a different namespace but this is definitely simpler :) [16:47:12] if you are ok I'll merge the change to the docs [16:50:01] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3493916 (10RobH) @elukey: So we do prefer sw raid over hw raid when purchasing servers. However, servers in this particular... [16:50:07] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493917 (10elukey) After a chat with @Ottomata we realized that applying the correct namespace (`'kafka.'` prefix to all the li... [16:50:26] 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3493918 (10elukey) [16:51:58] +1 elukey [16:52:17] \o/ [16:52:23] much better than we thought [16:54:09] mmm better to move the SSL section in conf.example [16:54:14] looks weird there [16:58:58] done :) [16:59:03] ooook going offline people! [16:59:04] byyyeeee [16:59:07] * elukey afk! [17:15:30] ok laters! [17:30:00] 10Analytics, 10Analytics-Wikistats: Broken in Firefox 50 - https://phabricator.wikimedia.org/T172304#3494127 (10intracer) [17:33:16] 10Analytics, 10Analytics-Wikistats: WiViVi Broken in Firefox 50 - https://phabricator.wikimedia.org/T172304#3494145 (10intracer) [18:06:45] (03PS8) 10Milimetric: Implement Wikistats metrics as Druid queries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365806 (https://phabricator.wikimedia.org/T170882) [18:14:04] milimetric: does it work? ^ [18:32:34] 10Analytics, 10Analytics-Wikistats: WiViVi Broken in Firefox 50 (Linux only) - https://phabricator.wikimedia.org/T172304#3494431 (10Erik_Zachte) [18:43:25] ottomata: yes, the monthly grouping works and it's much faster, at least 5 times faster. [18:43:39] awesooome! [18:43:42] but this one specific thing doesn't work still [18:43:44] glad we figured that out then [18:43:45] oh? [18:44:16] yeah, it's ok, we just have to design some pre-computation around it, worst case. And we asked on their forum [18:44:30] it's supposed to work, but maybe just in latest [18:50:53] mforns: should we merge the script that drops mediawiki partititions or still needs more testing? [18:51:08] nuria_, I haven't retested it yet [18:51:16] mforns: ok, you let me know [18:51:20] k [18:59:18] mforns: (re your note at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Black-listed_schemas ) - we will need to increase the event rate of the popups schema in 2 weeks from now [18:59:29] it currently sends about 5/second on average https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Popups&refresh=5m&orgId=1 [18:59:46] we would aim to stay below 100 [19:16:04] HaeB: so you know at that rate it is likely mysql tables will grow too fast for the datastore so the backend will need to be hadoop cc mforns for confirmation (specially since disk doesn't have much space in one of the hosts, we are trying to free space as we speak) [19:16:38] yes we would really like to avoid it being blacklisted [19:16:43] that would create a lot of extra work [19:17:21] (having to query it in hadoop) [19:19:25] HaeB: would it be less of a pain if tables and partitions automatically showed up in Hive like webrequest does? [19:20:33] ottomata: yes, that was the first stubling point when i tried out the instructions for that last year (with mforns) [19:21:36] it would still mean a totally new workflow, so i don't know if that would be worth if for this case (popups - i.e. redo everything we did for lower rates on mysql) [19:21:39] HaeB, I see [19:21:56] HaeB, is the change going to be permanent or temporary? [19:22:14] ..but in the long run that would of course be a valuable option and (i hope) also make querying large tables easier [19:22:41] mforns: temporary - still figuring out the details with olliv, but it might be two weeks or such [19:22:49] HaeB: https://gerrit.wikimedia.org/r/#/c/346291/ is basically done [19:22:53] just need to find time to producitonize [19:24:07] HaeB, after 2 weeks the schema would recede back to 5 evts/sec? [19:24:18] or switched off, yes [19:24:35] again, need to confirm the details with olliv [19:25:55] .. ottomata: looks cool and i'm still eager to try this out sometime, but for this case it would mean rewriting all our existing queries for that schema, etc [19:29:00] mforns: if you think it should work for 2 weeks, we could take that as a first assumption, and CC you on the phabricator task where the sampling rate and experiment length will be determined in detail (so you could still raise concerns before it goes live, if needed) [19:29:40] HaeB, 100 evts/sec for 2 weeks will be around 120M events [19:30:04] we've had such experiments recently with QuickSurveyInitiations schema [19:30:27] that has around 145M events and 61 GB [19:31:49] so it's more about the total event number/disk space rather than about event rate/data intake speed? [19:32:14] HaeB, but that schema is a lot leaner, it has 5 fields, compared to 19 fields in the popups schema [19:32:33] HaeB, yes, it's more about the space [19:33:30] I'd guess that 2 weeks of Popups at 100evt/sec would end up being around ~240GB [19:36:01] (i know it's nobody's fault here, but i still don't understand why ops didnt plan for more disk space when ordering the new server) [19:37:19] HaeB, would it be possible to reduce the sampling rate a bit? [19:38:13] maybe, still need to figure out the details. by how much? [19:39:21] (FWIW, we did just recently make a mistake with that schema in setting the rate too low... the web team actually had to postpone a deployment because of this, to wait until we got enough data after increasing it) [19:40:37] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3494736 (10Cmjohnson) [19:43:39] HaeB, not sure, we should check with elukey who has been working hard to free up EL databases' disks last couple weeks [19:45:54] HaeB, also, is that urgent? We are working on archiving PageContentSaveComplete schema, which is a really big one. This will free up some space that could compensate for Popups [19:46:59] oh no, like i said, it would happen in 2 weeks from now (for 2 weeks), around aug 14 [19:47:20] HaeB, ok, I think PageContentSaveComplete will be archived by then [19:48:27] HaeB, I can create a task so we can sync-up with elukey tomorrow (if you didn't create it already) [19:48:37] and we discuss the sampling rate [19:49:03] mforns: ok, that works too [19:49:33] CC @ovasileva as well [19:49:38] k [19:58:00] 10Analytics-Kanban: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3494799 (10mforns) [20:04:32] 10Analytics-Kanban: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3494843 (10ovasileva) In terms of dates, we are planning on running the test from 8/14 to 8/28 [20:04:37] HaeB: FYI, these are still the old servers [20:04:39] afaik [20:05:13] i think the hold up is an industry shortage on SSds [20:05:27] (that's holding up other hw orders, not 100% sure about this one) [20:17:56] 10Analytics-Kanban: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3494882 (10mforns) My initial impression is that 100 evt/sec (240GB) would cancel a big part of the efforts to free up disk space in EL databases. Those efforts have been undertaken last weeks t... [20:19:42] 10Analytics-Kanban: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3494898 (10mforns) [21:20:51] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Update puppet for new Kafka cluster and version - https://phabricator.wikimedia.org/T166162#3495109 (10Ottomata) Allriiight! Doing good! I've applied role::kafka::jumbo::broker in labs to 3 nodes (kafka4-*). [21:21:01] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Update puppet for new Kafka cluster and version - https://phabricator.wikimedia.org/T166162#3495110 (10Ottomata) [21:22:48] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3495117 (10Ottomata) FYI: These nodes should be installed with Debian Stretch. [21:57:13] 10Analytics, 10Android-app-feature-Compilations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine how to gather top-viewed article lists for use in generating ZIM files - https://phabricator.wikimedia.org/T172296#3495225 (10Mholloway) [22:13:29] 10Analytics-Kanban: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3495281 (10Nuria) Analytics slave: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=dbstore1002&var-network=bond0&panelId=17&fullscreen&from=149938796... [22:18:34] 10Analytics, 10Analytics-Wikistats: WiViVi: Per-person should account for connected percentage? - https://phabricator.wikimedia.org/T172335#3495304 (10Krinkle) [22:19:14] 10Analytics, 10Analytics-Wikistats: WiViVi: Per-person should account for connected percentage? - https://phabricator.wikimedia.org/T172335#3495318 (10Krinkle) [22:58:13] (03PS6) 10EBernhardson: UDF for extracting primary full text search request [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/327855 (https://phabricator.wikimedia.org/T162054) [22:58:20] (03CR) 10EBernhardson: "rebased and addressed feedback." (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/327855 (https://phabricator.wikimedia.org/T162054) (owner: 10EBernhardson)