[01:42:17] 10Analytics, 10Research: Recommend the best format to release public data lake as a dump - https://phabricator.wikimedia.org/T224459 (10leila) [05:46:35] (03PS2) 10Elukey: mobile_apps: move uniques daily/monthly oozie coords to hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/528708 (https://phabricator.wikimedia.org/T227257) [05:46:40] (03CR) 10Elukey: [C: 03+2] mobile_apps: move uniques daily/monthly oozie coords to hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/528708 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [05:46:57] (03CR) 10Elukey: [V: 03+2 C: 03+2] mobile_apps: move uniques daily/monthly oozie coords to hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/528708 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [05:47:05] (03PS3) 10Elukey: pageview: move druid oozie coordinators to hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/528714 (https://phabricator.wikimedia.org/T227257) [05:47:44] (03CR) 10Elukey: [V: 03+2 C: 03+2] pageview: move druid oozie coordinators to hive2 actions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/528714 (https://phabricator.wikimedia.org/T227257) (owner: 10Elukey) [06:39:05] dr0ptp4kt: o/ [06:39:15] I noticed at around the time that https://yarn.wikimedia.org/proxy/application_1564562750409_37275/ started [06:39:38] (~19:55 UTC) that we had an increase of files on HDFS for an hour [06:39:43] up to +3M files [06:40:44] sadly I don't have enough hdfs action logs to track down who created all those files [06:40:51] but your spark shell kinda matches [06:43:44] I took the liberty of stopping it since I am investigating a heap increase pressure for the HDFS namenodes, the shell seemed a test, but I want to know if it caused temporary files to be created [06:44:04] sorry for the intrusion in your work ;( [06:59:58] (wasn't able to decrease the number of files, back to square one) [10:13:29] a-team: I have switched one HDFS namenode to the Java G1 GC [10:13:36] and failed over the master to it [10:13:38] as test [10:13:42] hi elukey - i'm up [10:13:50] hey [10:14:08] * dr0ptp4kt scratches head [10:14:14] sorry for the ping, I thought/hoped to solve an issue stopping a spark shell that you created but failed [10:14:30] I was checking if it was responsible for the 3M files creation on HDFS [10:14:31] here's what i recall from yesterday... [10:15:32] i tried running the 'with hive via spark' example on https://wikitech.wikimedia.org/wiki/SWAP [10:15:36] that errorred out [10:15:45] then, [10:16:48] i did the following alone from chelsy's notebook at https://analytics.wikimedia.org/datasets/one-off/English%20Wikipedia%20Page%20Views%20by%20Topics.html#Top-50-articles-read-in-March-2019-on-English-Wikipedia [10:18:22] yeah nothing problematic on that notebook, sometimes in the past we had a problem with spark queries creating a ton of temporary files [10:18:25] https://www.irccloud.com/pastebin/Mv5dLW1r/ [10:18:38] but not this case, after stopping your spark session nothing changed [10:18:43] ah [10:18:47] i see [10:18:47] it was only matching timings unfortunately [10:19:05] i remember reading about that on a mailing list? i was having a chuckle, because i know i've melted things down once or twice [10:19:38] we still haven't a good way to limit/thottle these events, but we have good monitoring :) [10:19:46] thanks a lot for following up! sorry for the ping [10:21:46] thank you for the heads up! i'll try to be careful as i do some bigger jobs here in the nearish future [10:31:12] a-team: going afk for lunch + errand, if anything happens with the namenodes and for some reason you can't reach me [10:31:15] sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [10:31:18] should do the trick [10:31:35] (an-master1001's namenode is still running with the old CMS GC, and currently set as standby) [10:31:40] * elukey afk! [12:44:48] Amir1: o/ [12:44:50] you there? [12:44:56] I just seen the ping in #operations sorr [12:44:57] *sorry [12:48:02] elukey: meeting now but I will get back to you [12:48:54] super [13:08:15] elukey: Ok. It seems we have a problem with wmde analytics not being able to connect to wdqs [13:09:15] Amir1: yep yep, let's try to see what's wrong [13:09:23] can you tell me what call is failing? [13:09:36] if it is traffic from stat1007 to wdqs then in should work [13:09:45] the firewall was only for traffic to stat1007 [13:10:35] but wdqs1005.eqiad.wmnet times out in stat1007 [13:10:42] try curl there [13:11:21] to port 8888? [13:11:43] if so it works for me [13:11:51] curl wdqs1005.eqiad.wmnet:8888 [13:12:11] do you have a http_proxy setting in your env? [13:12:15] or similar [13:12:31] because curl picks it up IIRC [13:14:37] Amir1: --^ [13:15:00] oh, let me check [13:15:50] yeah, it was the port [13:15:52] facepalm [13:16:13] all right, nothing seems broken then :) [13:16:31] Amir1: did you see the last updates in https://phabricator.wikimedia.org/T176875 ? [13:16:45] in theory eventually we should get to a point that you guys can use the LVS endpoint [13:16:50] and not hardcoding hosts [13:17:04] let me know if it is still something that you guys would like [13:17:23] that would be amazing [13:17:26] Thanks! [13:20:58] (03PS1) 10Ladsgroup: Fix port when connecting to WDQS [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/529094 (https://phabricator.wikimedia.org/T214894) [13:21:02] ack then, if you have time please comment on the task :) [13:30:06] (03CR) 10Alaa Sarhan: [C: 03+2] Fix port when connecting to WDQS [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/529094 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:31:29] (03Merged) 10jenkins-bot: Fix port when connecting to WDQS [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/529094 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:34:59] (03PS1) 10Ladsgroup: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529097 (https://phabricator.wikimedia.org/T214894) [13:35:36] need to step afk for ~30 mins (workers at home) [13:36:12] (03CR) 10Ladsgroup: [C: 03+2] "It's a bin file, there's no way to review this." [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529097 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:36:23] (03Merged) 10jenkins-bot: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529097 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:39:46] (03PS1) 10Ladsgroup: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529098 (https://phabricator.wikimedia.org/T214894) [13:39:55] (03CR) 10Ladsgroup: [C: 03+2] New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529098 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:40:01] (03Merged) 10jenkins-bot: New build [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/529098 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [13:41:26] elukey: one quick request, can you run puppet agent on stat1007? [13:44:12] Amir1: sure [13:45:24] done! [13:47:03] Thanks! [13:55:22] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) Me and Marcel discussed about this, we already index IPs in webrequest_sampled_128, so it shouldn't be a huge problem but we'll have to try with one day of data first (... [13:58:06] hey team :D [13:58:18] o/ [13:58:55] mforns: should we meet today or can we skip to next week? (I suppose you'll be on duty right?) [13:59:26] elukey, yes, let's do ops week next week :] [13:59:40] and we can brief it on monday [14:00:52] hm, by the way elukey, we do not overlap a lot of hours, so next week we'll be working alone 50-60% of the time, is that going to be a problem? [14:01:17] if so, I can try to start working earlier [14:03:27] nono it is super fine [14:03:38] I'll try to check after dinner if anything is needed [14:03:40] no problem [14:12:28] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10mforns) > Is that in general or for longer retention time? Maybe we can store aggregated data (without source/dest IP) for long term storage? > We can also use IP prefixes inst... [14:17:02] a-team can you visit https://stats.wikimedia.org/v2/#/all-projects and see if it works fine for yall? [14:17:34] from my limited understanding yes, are you seeing something weird/unexpected? [14:17:41] fdans, hm, for me it's super broken [14:17:46] I'm getting a 404 on the spreadsheet [14:17:49] aah! fonts are messed up on mobile [14:17:59] what the hell [14:17:59] Must be a bad deploy but how... my bad [14:18:02] Looking [14:18:21] but when? I was browsing wikistats yesterday without a problem [14:19:18] I deployed yesterday fdans [14:19:26] but I checked it and it was fine [14:19:38] it's fine on desktop right? [14:19:53] milimetric: nope, css is 404ing [14:20:01] milimetric, I'm looking from desktop, and it seems I see the mobile version [14:20:18] so weird, if I look from my desktop it's fine [14:20:40] I'm wondering if it's a cache thing [14:20:40] I use chrome from ubuntu [14:20:48] oh wait [14:20:48] like, we're requesting an out of date css hash [14:20:54] fdans: https://stats.wikimedia.org/v2/main.bundle.6bb1aa806f695a0bf1c1.css [14:21:07] yeah that's different [14:21:09] what the hell [14:21:15] oh so it's caching the index [14:21:18] firefox as well broken [14:21:45] milimetric: mforns I'm getting a 404 on https://stats.wikimedia.org/v2/main.bundle.e1622cb13ffa9b090e25.css [14:21:47] I think what's happening is this is the first time we're deploying the new semantic update maybe [14:21:53] yeah, that's the old bundle [14:21:55] I've flushed cache and hard reloaded [14:22:07] something's still hanging on to that old html [14:22:26] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Nullifying them is fine. Depending on how costly they are, we could consider getting rid of other fields as well after 90 days (and aggregating the data through the re... [14:22:38] if you look at the index.html deployed, it references the CSS I showed [14:22:44] milimetric: the site version at the bottom says 2.6.5 [14:23:06] yeah, that makes no sense, I deployed 2.6.6 [14:23:31] milimetric: it seems like a server propagation thing [14:23:44] I am getting 404 too in https://stats.wikimedia.org/v2/main.bundle.e1622cb13ffa9b090e25.css [14:23:51] which would make sense if we were hosting it on multiple servers :) [14:23:59] me too 404 [14:24:13] I'm getting that on my mobile device, but not on desktop [14:24:17] wanna batcave and talk about this? [14:24:18] on desktop I have latest [14:24:20] sure [14:24:24] omw [14:30:18] elukey: could varnish be hanging on to an old version of this? [14:31:35] https://github.com/wikimedia/puppet/blob/fd430ec2680dd8f25717dbb5926c671cdf579188/modules/statistics/manifests/sites/stats.pp [14:32:38] * dsaez is attending a great tech talk by milimetric [14:32:39] milimetric: old version of this --> ? puppet link or the stats one? (sorry didn't get it) [14:32:58] elukey: to the cave [14:59:04] 10Analytics: Tune Wikistats 2 Varnish caching - https://phabricator.wikimedia.org/T230136 (10Milimetric) [14:59:51] 10Analytics, 10Analytics-Kanban: Tune Wikistats 2 Varnish caching - https://phabricator.wikimedia.org/T230136 (10Milimetric) p:05Triage→03High a:03Milimetric [15:20:09] 10Analytics, 10Product-Analytics: Add page protection status to MediaWiki history tables - https://phabricator.wikimedia.org/T230044 (10Milimetric) p:05Triage→03High [15:21:26] 10Analytics, 10Analytics-EventLogging: Port reportupdater queries that use MySQL log eventlogging database to Hive event database - https://phabricator.wikimedia.org/T229862 (10Milimetric) p:05Triage→03High [15:23:37] 10Analytics, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: Instrument Special:Mute - https://phabricator.wikimedia.org/T224958 (10Milimetric) We just want to be involved if you want to whitelist data to be kept more than 90 days. Other than that, you don't need any approval from us. We're happy... [15:30:33] elukey, I'm going to delete the dump files I created yesterday, will that trigger an alarm as well? [15:40:09] elukey: one question about restarting the cassandra bundle [15:40:50] I just wanted to double check that we shouldn't push this to the beginning of the next month, which is when we usually restart the cassandra bundle (cc mforns ?) [15:42:37] fdans, yea... we can restart the job on the 1st of sept no? [15:44:18] mforns elukey yeah I'm asking because every time I've touched the cassandra loading bundle I've had to wait to the 1st to restart the whole thing, but I'm not sure if that applies here [15:46:24] mforns: nono it doesn't trigger any alarm, it will just free heap space :) [15:46:58] fdans: yes we can definitely wait, the point is that we have to recompute 8 days at this point [15:47:08] (too many 'point' sorry) [15:47:15] it is not a huge deal, but we can wait [15:47:28] let's add a note to the etherpad [15:47:37] that's a good point elukey [15:47:40] :P [15:55:13] thank you, adding note to the etherpad's header [16:02:53] !log restarting edit_hourly [16:02:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:08:07] !log restarting oozie coordinator mobile_apps-uniques-monthly-coord [16:08:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:09:56] !log restarting oozie coordinator mobile_apps-uniques-daily-coord [16:09:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:13:05] !log restarting oozie coordinator pageview-druid-hourly-coord [16:13:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:14:30] !log restarting oozie coordinator pageview-druid-daily-coord [16:14:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:15:38] !log restarting oozie coordinator pageview-druid-monthly-coord [16:15:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:24:22] (03PS1) 10Fdans: Update changelog for 0.0.97 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/529115 [16:25:26] (03CR) 10Fdans: [V: 03+2 C: 03+2] Update changelog for 0.0.97 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/529115 (owner: 10Fdans) [16:25:57] !log releasing refinery-source 0.0.97 to Maven [16:25:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:33:57] (03Merged) 10jenkins-bot: Update changelog for 0.0.97 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/529115 (owner: 10Fdans) [16:38:43] !log updating jars [16:38:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:40:10] !log deploying refinery [16:40:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:43:35] elukey: I'm getting deployment shenanigans [16:43:55] fdans: tell me all about it [16:43:59] https://www.irccloud.com/pastebin/ane6GVH2/ [16:45:11] deploy-local failed: {u'full_cmd': u'/usr/bin/git fat pull', u'stderr': u'\nrsync: link_stat "/git-fat/fce897afd0d4d8bc0082ff451c20f23369e0cfcc" (in archiva) failed: No such file or directory [16:45:15] this is kinda weird [16:50:18] fdans: it might be a messed up state in stat1007 [16:50:28] I tried to do git fat pull on my local version on stat1004 and works [16:50:47] i see [16:51:10] elukey: soooo what should i do? [16:53:00] fdans: you can try to debug it on stat1007 with me :) [16:53:16] is /git-fat/fce897afd0d4d8bc0082ff451c20f23369e0cfcc a specific jar or something new? [16:54:37] elukey: yeah the only heavy files are the new jars [16:56:04] so git fat pull on stat1007 returns nothing to download [16:56:09] but the jar is indeed not there [16:56:22] I am not sure how to find the match between filename and git-fat sha [16:58:08] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Fix revision-score event production in change-prop after migration of revision-create to eventgate-main - https://phabricator.wikimedia.org/T228688 (10Pchelolo) 05Open→03Resolved Fi... [16:58:14] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 4 others: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 (10Pchelolo) [16:59:20] elukey: I can do bc if you want to brainbounce [17:01:39] fdans: let's abort the deployment (without rollback) [17:02:36] elukey: yeah about that... I may have lost the ssh connection and I forgot to tmux [17:02:49] so I guess the deployment was aborted without rollback? [17:03:49] not cleanly https://tools.wmflabs.org/sal/production (it shows only your start..) [17:05:07] I am checking the status of scap on deploy1001 but don't see anything to specifically abort anything in progress [17:05:11] well I guess it stopped :) [17:07:13] elukey: yeah I'm sorry, I remembered about using tmux the moment I pressed enter, never happened before [17:07:20] no problem [17:07:27] I am cleaning up stat1007 [17:07:36] in a bit we should be ready to re-deploy [17:08:29] awesome [17:11:29] elukey@stat1007:/srv/deployment/analytics$ find -name fce897afd0d4d8bc0082ff451c20f23369e0cfcc [17:11:32] ./refinery-cache/revs/cef01d355c5c4576b90d0f95311e8eb66a58fe58/.git/fat/objects/fce897afd0d4d8bc0082ff451c20f23369e0cfcc [17:11:35] we should be good [17:11:40] fdans: can you try another deploy? [17:12:00] elukey: ON IT [17:13:15] elukey: oh there's this thing [17:13:16] 17:12:52 deploy failed: Failed to acquire lock "/var/lock/scap.analytics_refinery.lock"; owner is "fdans"; reason is "deploying analytics refinery" [17:14:49] ahahahha [17:15:15] you can in theory remove it [17:15:24] it is owned by you [17:15:28] that should unblock you [17:15:40] elukey: luca I know nothing [17:17:55] elukey: canary fine, moving on [17:25:37] fdans: not a blocker but I can see in scap deploy-log [17:25:37] 17:17:21 [stat1004.eqiad.wmnet] Using deprecated git_fat config, swap to git_binary_manager [17:25:47] so we might need to upgrade our config [17:26:38] elukey: this thing is taking ages [17:27:44] yeah [17:29:45] (03PS1) 10Elukey: Move to git_binary_manager: git-fat [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/529126 [17:29:51] it took a long time for me too last week [17:33:34] done! [17:33:38] !log scap deploy of refinery done [17:33:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:34:04] going afk for a bit, checking later [17:34:07] is it ok fdans ? [17:34:19] yes [17:46:31] I remember the time when I pressed enter on this command, Barack Obama was still president [17:47:06] omfg finally [17:47:15] !log refinery deploy successful [17:47:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:32] a-team [17:47:37] https://usercontent.irccloud-cdn.com/file/O1BmzEX2/giphy-1.gif [17:48:00] O.o [17:48:00] THE TRAIN HAS ARRIVED TO ITS DESTINATION [17:48:01] xDDD [17:48:04] I'm done for the day [17:48:08] bai [18:39:14] (03CR) 10Thcipriani: [C: 03+1] "LGTM! Thanks for this!" [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/529126 (owner: 10Elukey) [18:58:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] Move to git_binary_manager: git-fat [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/529126 (owner: 10Elukey) [19:00:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] "> LGTM! Thanks for this!" [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/529126 (owner: 10Elukey)