[00:51:27] 10Analytics, 10Analytics-EventLogging: Allow (almost?) all EventLogging events to go into MySQL in beta - https://phabricator.wikimedia.org/T208359 (10MMiller_WMF) Thanks for the links, @Nuria. I should have been more clear -- it is not //strictly// blocking our ability to test (since our engineers can look a... [06:13:56] [VOTE] Release Bigtop version 1.3.0 [06:13:59] \o/ [07:49:23] 10Analytics, 10Analytics-EventLogging, 10Operations, 10Performance-Team, 10Traffic: Increase EventLogging limit from 2K to 5K - https://phabricator.wikimedia.org/T208282 (10ema) p:05Triage>03Normal [07:53:40] Morning elukey :) [07:53:45] o/ [07:54:11] elukey: Have you seen the alert on GeoIP? [07:55:05] joal: yep I did [07:55:21] it makes sense, it tries to make an hardlink across two partitions, no bueno [07:55:36] I didn't think about it [07:55:41] elukey: I assume you have a better understanding than me of why a "cross-device link" is forbidden [07:56:06] yeah the hardlinks are doable only within the same partition [07:56:40] now we are trying to do an hardlink between /usr/share/GeoIP (root part) to /srv/geoip/archive (srv part) [07:56:48] and the kernel complains [07:56:58] it is right, the operator (me) was sloppy :) [07:58:33] too much memcached I assume ;) [08:00:56] nono I simply always forget about this limitation [08:00:58] :( [08:01:11] so now I am wondering what it is the best option [08:01:49] elukey: I assume you use hardlink to prevent deleting in /usr impact /sv ? [08:02:02] s|/sv|/srv [08:02:27] joal: I didn't create the script, it was Andrew/Fran's [08:02:48] I simply moved the archive dir to /srv to prevent puppet to run for minutes [08:02:57] hm - I'm no link specialist, but why hard we we might be soft? [08:03:34] I think that it allowed a simpler script [08:04:00] Ah ok [08:04:20] in theory if you have A->inode-x and B->inode-x (hardlinks), when you delete A then another link is active and inode-x is not deleted [08:04:43] so from what I recall from the script, it moves the current dir to archive, and hardlinks it [08:04:48] something like that [08:05:02] so we could move /usr/share/GeoIP to /srv/ [08:05:20] or, since the script backs up to HDFS, we could get rid of the "local" copies [08:06:03] makes sense - hdfs as long term storage is hopefully enough [08:13:53] so joal, https://gerrit.wikimedia.org/r/470778 might be enough to solve the issue [08:14:05] but I need to ask to Andrew/Fran if I am missing something [08:15:17] ok elukey :) [08:18:36] also no stalls registered for org.apache.hadoop.util.JvmPauseMonitor [08:18:47] too curious now :D [08:22:04] 10Analytics-Kanban: Fix refinery-source jenkins build/release jobs - https://phabricator.wikimedia.org/T208377 (10JAllemandou) [08:22:21] so changes for oozie [08:22:21] Version table: [08:22:21] 4.1.0+cdh5.15.1+492-1.cdh5.15.1.p0.4~jessie-cdh5.15.1 500 [08:22:21] 500 http://archive.cloudera.com/cdh5/debian/jessie/amd64/cdh jessie-cdh5/contrib amd64 Packages [08:22:24] *** 4.1.0+cdh5.10.0+389-1.cdh5.10.0.p0.71~jessie-cdh5.10.0 100 [08:22:25] 10Analytics-Kanban: Fix refinery-source jenkins build/release jobs - https://phabricator.wikimedia.org/T208377 (10JAllemandou) a:03JAllemandou [08:22:27] 100 /var/lib/dpkg/status [08:22:35] I suppose from the horrible numbering that we stay at 4.1 [08:23:28] I would assume so elukey - But numbre is indeed not nice [08:23:40] elukey: T208377also [08:23:45] elukey: T208377 alsosorry [08:23:46] T208377: Fix refinery-source jenkins build/release jobs - https://phabricator.wikimedia.org/T208377 [08:24:27] yeah [08:24:43] and we also have to upgrade the jvm soon [08:26:12] elukey: right [08:26:49] so it might be probably wise to decouple jvm upgrade and cluster upgrade [08:32:55] elukey: could be [08:33:31] elukey: the problem we are experiencing on jenkins is due to surefire, but other libs could also embed the same issue [08:33:42] yes this is my fear [08:33:48] elukey: would be interesting for that jvm bump to test in labs first maybe? [08:34:17] it surely is a good option, but we need to be sure that everything is tested [08:35:03] elukey: right [08:42:50] ahhh upgrade-1.1.0-to-1.1.0-cdh5.12.0.mysql.sql [08:42:56] so hive needs a mysql upgrade [08:43:21] hm [08:43:51] 1.1.0 to 1.1.0 needs a SQL upgrade ... maaaaaaaaan! This is good version naming! [08:44:16] ALTER TABLE VERSION ADD COLUMN SCHEMA_VERSION_V2 VARCHAR(255); [08:44:16] UPDATE VERSION SET SCHEMA_VERSION='1.1.0', VERSION_COMMENT='Hive release version 1.1.0', SCHEMA_VERSION_V2='1.1.0-cdh5.12.0' where VER_ID=1; [08:44:19] SELECT 'Finished upgrading MetaStore schema from 1.1.0 to 1.1.0-cdh5.12.0' AS ' '; [08:44:28] -.- [08:44:43] ah it does SOURCE 041-HIVE-16556.mysql.sql; [08:51:55] very minor change though [08:51:58] they add a table [08:58:18] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update to cloudera 5.15 - https://phabricator.wikimedia.org/T204759 (10elukey) Missed: https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_ig_hive_upgrade.html#topic_18_2_2 This requires a minor change in the hive db schema, updated the etherpad. [09:57:30] (03PS1) 10Fdans: Add change_tag to list of mediawiki tables to be dropped [analytics/refinery] - 10https://gerrit.wikimedia.org/r/470793 (https://phabricator.wikimedia.org/T205940) [09:58:03] joal: helloo this is the last piece you were mentioning right? [09:58:33] yessir 1 [09:58:36] Thanks fdans :) [09:59:36] c'est mon plaisir joal! [10:15:14] (03PS1) 10Fdans: Deprecate reportcard [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/470795 (https://phabricator.wikimedia.org/T203128) [10:39:18] fdans: o/ [10:39:29] hellooo [10:40:23] whenever you have time can we chat about the geoip failure and https://gerrit.wikimedia.org/r/470778 ? [10:40:36] didn't follow why you guys used hard links at the time :( [10:42:02] elukey: we did the hardlinks to save storage space [10:42:27] but I suspect it doesn't make much of a difference [10:42:48] ah ok so it was an optimization, I'll ask andrew if he is ok to remove it [10:43:01] because I think we made that choice in the assumption that there are repeated files across geoip snapshots [10:43:02] because now we can't create hard links across partitions (root and /srv) [10:43:39] elukey: but I think that assumption is false, so yeah, I'd get rid of em [10:43:45] (but ofc ask andrew) [10:50:04] ack :) [10:50:11] joal: so I have cdh 5.15 in labs [10:54:56] (including druid) [10:55:10] and I started https://etherpad.wikimedia.org/p/analytics-cdh5.15 [10:55:38] if you want we could test today/tomorrow few things [10:55:50] tomorrow is holiday [10:55:51] friday :) [10:59:28] fdans: ok to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470593/ ? [11:01:40] elukey hmmm I think we gotta wait for the corresponding refinery patch to be deployed, right joal? [11:10:14] sure sure [11:19:47] * elukey lunch! [12:40:16] 10Analytics, 10Analytics-Wikimetrics: Get a separate Labs Project to host wikimetrics instances in - https://phabricator.wikimedia.org/T76808 (10Krenair) 05Open>03Resolved There appears to be a `wikimetrics` project so I assume this was resolved ages ago. [12:40:55] 10Analytics, 10Analytics-Wikimetrics: Get a separate Labs Project to host wikimetrics instances in - https://phabricator.wikimedia.org/T76808 (10Krenair) Looks like it was {T122108} [13:14:17] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (later): Make schemas use required $schema property with absolute path to the schema - https://phabricator.wikimedia.org/T208361 (10Ottomata) We need to be able to identify the schema for an event o... [13:16:47] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Later), 10Services (later): Make schemas use required $schema property with absolute path (not absolute URL) to the schema - https://phabricator.wikimedia.org/T208361 (10Ottomata) [13:22:27] 10Quarry, 10Cloud-VPS: Check whether quarry project requires NFS or not - https://phabricator.wikimedia.org/T208411 (10Krenair) [13:34:35] ottomata: o/ [13:34:45] so there is a schema upgrade for hive to do [13:34:50] for cdh 5.15 [13:34:56] but it is a super simple one, it creates one table [13:35:10] hive/oozie work fine in labs (cdh 5.15) [13:35:12] oh ok [13:35:13] great! [13:35:35] whenever you have time can you check https://etherpad.wikimedia.org/p/analytics-cdh5.15 to see if anything looks weird/wrong? [13:36:07] I'll do some tests with joal but I'd say that we can upgrade next week [13:36:11] if you agree [13:46:25] 10Quarry, 10Cloud-VPS: Check whether quarry project requires NFS or not - https://phabricator.wikimedia.org/T208411 (10zhuyifei1999) 05Open>03Invalid Yes {T178520} [13:47:17] elukey: added bits about notebook100[34] [13:47:30] q: at the bottom 'ADD THESE STEPS FOR NEXTIME' [13:47:34] lsof for deleted libs? [13:47:45] that'd be via cumin to just make sure no procs are still running with old stuff? [13:47:48] also, stop reportupdater? [13:47:54] shoudl that be added to stat* cron job stop task [13:47:55] ? [13:48:18] yep yep still needed to go through those, I think that adding a cumin command for lsof is good [13:48:29] also stopping report updater [13:50:09] addshore: o/ so ::statistics::wmde can be moved over to stat1007? [13:56:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [13:56:41] bearloga: o/ [13:57:00] bearloga: sorry for the delay but I moved statistics::discovery to stat1007 [13:57:08] published datasets rsynced as well [13:57:14] and the cron is now only on stat1007 [13:57:27] let me know if everything is ok or if you want me to revert [13:57:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [13:59:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [13:59:40] 10Quarry, 10Cloud-VPS: Check whether quarry project requires NFS or not - https://phabricator.wikimedia.org/T208411 (10Krenair) 05Invalid>03Resolved thanks @zhuyifei1999 [14:01:21] 10Quarry, 10Cloud-Services, 10cloud-services-team (Kanban): Migrate 'Quarry' project to eqiad1 - https://phabricator.wikimedia.org/T207677 (10Andrew) @zhuyifei1999 or @Framawiki, can one of you announce this downtime to interested parties? Or at least rattle of a list of contacts here so I can do that? [14:06:04] elukey: should be fine [14:06:05] doooo ittt! [14:08:49] ack thanks! [14:08:58] anything that it needs to be synced over to work? [14:12:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [14:12:40] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) @mpopov all moved, let me know if anything is missing! Cron removed from stat1005 too :) [14:20:04] elukey: *thinks* [14:20:23] it should be good to go, i dont believe there is any state [14:21:02] addshore: so I removed the crons from stat1005, and I am rsyncing /srv/analytics-wmde/* [14:21:18] the crons will appear soon on stat1007 [14:21:32] I am going to announce today the move to everybody [14:22:38] ack [14:23:19] elukey: Just read the etherpad - Sounbds good, except maybe spark [14:23:46] elukey: We have manual installations of spark (2.3.1) - Do we need to update the CDH ones? [14:24:24] Also elukey about labs, I was thinking of upgrading javaversion, not CDH :D [14:25:53] elukey: hey, there seems to be a puppetfail on stat1007, possibly due to ca426ea318558aa661e6c52d1f73273f936c0f68 ? [14:26:22] Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/srv/analytics-wmde/graphite],File[/srv/analytics-wmde/wdcm] [14:27:21] joal: elukey i wonder if we can uninstall spark1... [14:27:22] :) [14:27:36] ottomata, elukey - Can we bump java on stat1004 yo the latest, for me to test maven issues? [14:28:03] ottomata: I'd like to, none of our jobs uses spark1 anymore, but I think discovery still have some [14:28:17] bump java? [14:28:20] is there a new .deb? [14:29:14] elukey: okay, installing R packages now and then will test queries & scripts [14:31:50] wow I got away 5 mins and 10 people ping me :D [14:31:56] mwahaha :) [14:32:00] ema: thanks! I am working on it [14:32:55] ottomata: There must be - See T208377 [14:32:55] joal: not sure if the new version is available, and moritz is out today, so we'll need to wait [14:32:55] T208377: Fix refinery-source jenkins build/release jobs - https://phabricator.wikimedia.org/T208377 [14:33:02] mwarf [14:33:08] elukey: <3 [14:33:15] yeah there is one security release that now introduces a specific check for the jars loaded IIUC [14:34:15] elukey: https://packages.debian.org/stretch/openjdk-8-jdk [14:34:26] Different version than the one we have on stat1004 for instance [14:34:49] * joal hides far away before being caught back on not knowing enough about apt [14:35:04] joal: sure, but I am not sure what is moritz's workflow for these [14:35:20] I can update on stat1004 if you want though [14:35:27] but others might see the effects [14:37:17] rigth elukey - I'm gonna use labs [14:37:23] will be better [14:41:18] joal: we can't really skip the generate_json_pageview step in the druid pageviews workflow unless we flatten the user agent map in the table right? [14:41:46] hm - interesting ! [14:43:43] joal: unless there is a way that the indexer can access hive maps [14:43:58] fdans: in a meeting now, will discuss in a while [14:44:10] i was hoping that with the ts change we'd be able to skip the temporary table creation, but there's the ua [14:44:11] sure! [14:44:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [14:45:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [14:51:33] ah snap calendar didn't remember to me the meeting for datalake [14:51:36] just seen it now [14:51:44] lemme know if you guys need me, sorry! [14:55:59] heya teammm! [15:01:26] 10Quarry, 10Cloud-Services, 10cloud-services-team (Kanban): Migrate 'Quarry' project to eqiad1 - https://phabricator.wikimedia.org/T207677 (10zhuyifei1999) I added a [[https://quarry.wmflabs.org/|maintenance message]]: ``` MAINTENANCE_MSG: 'Quarrt will be down for maintenance on Monday, November 5 at 5 PM UT... [15:04:01] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [15:18:22] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) @elukey the new disk arrived, I am happy to swap it whenever you're ready. it's the first disk on the server and you will need manually replace it in raid since it's SW raid. ping w... [15:21:34] hey joal, elukey, fdans, do you want me to mention anything specific in scrum of scrums? [15:21:59] mforns: yes! stat1005 users are moved to stat1007 0 T205846 [15:22:00] T205846: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 [15:22:10] mforns: nothing my side :) [15:22:13] the '0' should have been '-' sorry [15:22:30] fdans, elukey OK! [15:22:33] thanks [15:22:37] a-team: there is now a scary DO NOT USE THIS SERVER on stat1005 when you log in [15:22:40] can you check? [15:22:47] ok [15:23:03] hahahahaha [15:23:18] it works [15:23:19] elukey: it is BIg :) [15:23:21] :] [15:23:32] i copied the same one of the deployment servers [15:23:39] joal, sth to mention on SoS from your side? [15:23:44] (you can get it if you ssh to deploy2001) [15:24:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [15:31:31] joal you got a couple min to talk now? [15:31:43] still in meeting [15:35:50] sorryyyy [15:39:56] (03CR) 10Nuria: [C: 032] Add change_tag to list of mediawiki tables to be dropped [analytics/refinery] - 10https://gerrit.wikimedia.org/r/470793 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [15:42:04] (03CR) 10Nuria: [V: 032 C: 032] Deprecate reportcard [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/470795 (https://phabricator.wikimedia.org/T203128) (owner: 10Fdans) [15:42:23] elukey: can i just check when exactly youll move the wmde puppet stuff over? [15:42:23] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson) [15:42:40] just so I know when to send a mail to our internal mailing list to keep an eye out for anything off happening [15:43:03] or is it done already? (it might be now i look at the ticket) [15:46:15] addshore: already done [15:47:50] fdans: I have time ! [15:47:58] joal: cave? [15:48:22] OMW ! [15:50:11] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Just had a great meeting with @chasemp, @faidon, @JAllemandou and @nuria. The main action item (after Nuria h... [15:50:22] elukey: amazing [15:53:25] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10chasemp) My notes from the 2018-10-31 meeting: ```https://phabricator.wikimedia.org/T207321#4691776 * hosts that push... [16:00:55] ping milimetric [16:01:59] I’m sick in bed Nuria, sorry [16:04:58] get well soon milimetric!