[02:30:14] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2658074 (Ottomata) Hm, true. But you know, it would be pretty trivial to have both options at different URLs. I just hacked this together in service-template-node...it works. Haven't gott... [06:09:05] Analytics-Kanban: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2658204 (elukey) p:Low>Normal [06:41:15] joal: o/ [06:41:28] if you are ok I'd stop camus and oozie bundles [06:41:35] as prep step for the reboots [06:45:04] !log stopped camus on analytics1027 and suspended webrequest-load-bundle via Hue (prep step for reboots) [06:45:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [06:45:16] let me know if it is enough :) [06:46:05] * elukey commutes to the office [07:11:17] * elukey back [07:12:23] mmm the spark shell app master issue in Yarn UI seems to be related to a 200 with CL 0 [07:15:32] but then it goes to 503 [07:15:42] and I don't see anything like that in the apache logs [07:42:05] now I can still see stuff on stat1002/3 running in cron and screen for different people [07:44:36] neilpquinn, halfak - I can see screen sessions and cron jobs with your username on stat1003 [07:44:52] (probably you are not awake but it is worth to try) [07:46:51] and there is an apache running on stat1002? [07:48:31] same thing on 1003 [07:52:59] sorry to everybody, just rebooted stat100[23] :/ [07:54:07] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Allow AKlapper to access https://wikimedia.biterg.io/edit/ - https://phabricator.wikimedia.org/T144704#2658301 (Lcanasdiaz) a:Dicortazar>Lcanasdiaz [08:11:33] Starting the reboot of the Hadoop cluster [08:52:24] !log varnishkafka 1.0.12 installed in cache:upload esams [08:52:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:56:54] Hi elukey [08:57:10] No alerts from what I see on Hadoop, sounds good [08:58:29] hello :) [08:58:37] I am restarting 104* atm [08:59:03] I can see some jobs killed in yarn but could it be due to the stat restarts? [08:59:18] or should those have kept running independently? [08:59:33] (user jobs, mostly milimetric and mforns) [09:14:06] !log varnishkafka 1.0.12 installed in cache:upload codfw [09:14:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:32:35] joal: I think that we'd need to reboot 1027 and 1003 as well [09:32:49] so stopping oozie and hive* daemons [09:39:09] !log rebooted analytics1027 [09:39:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:43:39] ah nice hue does not come up [09:43:41] grrr [09:47:52] I forget all the times about the apache instance that comes up and steals port 8888 to hue [09:47:55] grrr [09:47:59] need to nuke it [09:49:24] !log suspended all oozie bundles as prep step to reboot analytics1003 [09:49:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:51:54] !log executed aptitude remove apache2 on analytic1027 (we use nginx in front of hue, apache steals port 8888 to hue and it does not start) [09:51:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:58:50] so user 'west1' in hadoop is creating tables with hive [09:58:58] is he in this chat by any chance? [09:59:05] because I'd need to restart hive and oozie [10:04:07] !log rebooted analytics1003 (oozie, hive-metastore and hive-server2 daemons affected) [10:04:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:04:24] last ones are 100[12] [10:11:38] ah snap the mysql database [10:11:51] IIRC the last time we asked to Jaime to fix it [10:12:55] The MariaDB server is running with the --read-only option so it cannot execute this statement [10:14:54] elukey: I think there is a SQL statement that unlock it (like SET something = true;) [10:15:14] yeah but I don't have it written anywhere [10:15:23] neither do I :( [10:18:10] I stopped oozie as we did last time [10:18:13] maybe hive too [10:21:20] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [10:23:03] yeah I know oozie, but it is better that you stay down [10:23:27] Jaime is in a meeting, I'll ping him later on [10:25:27] so SET GLOBAL read_only = 0; should do the trick but I am not sure if we'd need to do other commands or not [10:25:31] very ignorant [10:25:55] elukey: IIRC it's the only thing [10:28:27] we can try [10:28:50] !log setting global read_only = 0 to analytics1003 mariadb instance [10:28:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:29:35] all good :) [10:32:00] now 1001 and 1002 are the only ones left to reboot [10:32:36] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [10:32:36] joal: would you mind to rapidly check if all is good? [10:33:52] elukey: eqi1004 [10:33:55] ops :) [10:35:41] elukey: Can't connect to hive from stat1004 [10:36:16] elukey: hue looks ok from oozie perspective [10:36:29] this one sounds familiar [10:36:39] is it a deja vu from the past reboots? [10:36:49] I can't recall elukey [10:36:53] hm [10:37:09] so hive uses the metastore right? [10:37:10] elukey: have you restarted camus? [10:37:11] on 1003 [10:37:18] correct elukey [10:37:19] joal: nope, it is stopped [10:37:23] k [10:37:36] (trying to check only what is needed :) [10:38:50] joal: now is it better? [10:39:17] yes elukey, solved :) [10:39:22] elukey: What was it? [10:39:51] I gently restarted the hive metastore :D [10:39:59] :d [10:40:01] I think it was upset for some reason [10:40:02] :D [10:40:04] Great [10:40:16] elukey: Possibly the mysql lock thing [10:40:17] NOOOWWWW [10:40:23] 100[12] ? [10:40:36] no prob for me, I don't have anything there [10:40:40] reboot 1002, wait, check, failover, reboot, etc.. [10:40:47] elukey: hadoop nodes have been restarted or not yet? [10:41:50] elukey: If yes, I'd rather restart camus and oozie [10:43:02] yes all restarted [10:43:21] so I just stopped hdfs and yarn masters on 1002, and checked that 1001 is still the boss [10:43:28] rebooting 1002 should be a no-op [10:44:00] Ahhhh elukey, analytics100[12], not stat100[12] !!! [10:44:03] I get it :) [10:44:13] sorry [10:44:51] okok! :D Maybe I wasn't clear earlier on sorry [10:45:17] Nah, just me not making links :) [10:45:23] :) [10:45:32] so 1002 is up and running, yarn/hdfs are bootstrapping [10:45:46] k [10:46:57] good, both daemons ready [10:50:13] so now I'll wait 15 minutes JUST TO BE SURE [10:50:18] and I'll failover 1001 [10:51:03] elukey: if ok, I'd rather wait 5 minutes, hadoop is gently delayed in the mean time [10:51:13] sure [10:51:15] :-P [10:51:18] Thanks ;) [10:51:28] I tried to do everything as fast as possible today :( [10:51:48] elukey: That's absolutely ok, I'm sorry if I came pressuring [10:52:01] nono I got your point :) [10:52:07] elukey: I'll shut up next time ;) [10:52:14] 15 minutes is not big [10:52:29] nono 5 are enough, ETOOPARANOIA [10:52:34] :D [10:52:55] in the meantime the new vk is deployed in upload esams and codfw [10:53:02] seems working fine [10:53:10] awesome, you rock :) [10:53:11] I'll deploy to ulsfo and eqiad this afternoon [10:53:29] Hope that this change will help [10:55:09] !log Failover from analytics1001 to analytics1002 as prep step for 1001's reboot [10:55:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:57:32] !log rebooted analytics1001 [10:57:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:00:44] joal: everything done [11:00:52] Great ! [11:01:07] I am wondering if we should keep 1002 as master for a couple of hours, and then failover again this evening [11:01:13] just to avoid messing up too much [11:01:20] what do you think? [11:01:40] I have no real opinion [11:02:04] elukey: only 'issue-ish' thing is yarn.w.o broken ;) [11:02:06] as expected [11:02:34] ahhhhh snap! [11:02:40] I forgot about that [11:02:49] elukey: no big deal :) [11:02:58] I can monitor CLI [11:03:22] ok let's do in this way, we can tunnel to 1002 for a bit [11:03:30] sounds good elukey [11:03:59] all right re-enabling everything [11:04:03] oozie and camus [11:04:10] k [11:04:58] !log re-enabling oozie and camus after cluster reboots [11:05:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:10:19] oozie started to complain about SLA miss, it is back :) [11:10:43] so follow ups: remove the read only at boot [11:13:56] elukey: We should autmatically do that yeah [11:18:23] I am going afk for a bit to have lunch, please let me know on hangouts if anyhing goes on fire [11:35:18] hi team! [11:35:26] Hey mforns [11:36:21] mforns: currently testing, but I might have found a way through restores :) [11:36:35] joal, oh! [11:36:36] cool [11:37:22] joal, do you want to explain? :] [11:37:29] I can mforns ! [11:37:38] batcave? [11:37:43] sure, OMW ! [12:28:41] mforns: I have a difficult case :( [12:28:54] joal, wanna cave? [12:29:00] yup [12:29:03] omw [12:47:55] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2658705 (mobrovac) I'd be +1 for using SSE here alongside WebSockets. The URI structure could be straightforward for this - `/{topic}{/partition}{/offset}` in its simplest form. To add filte... [12:48:39] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2658707 (mobrovac) [12:50:52] !log varnishkafka 1.0.12 installed in cache:upload ulsfo and eqiad [12:50:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [12:51:01] we are missing misc, maps and text [12:51:23] I'll try to do maps/misc today if I have time [12:51:26] and text tomorrow [12:51:32] but upload is covered [13:12:50] !log restarted oozie jobs for 2016-9-22-6 [13:12:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:16:07] !log set read_only = false (on startup) for the analytics1003's mariadb instance [13:16:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:16:51] !log previous comment was meant to be read as "set a permanent read only = false" [13:16:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [13:47:51] Analytics-Dashiki, Analytics-Kanban, Need-volunteer: Vital-signs layout is broken - https://phabricator.wikimedia.org/T118846#2658817 (Milimetric) Todos for Dashiki, collected while doing this refactor [] refactor code and style for out-of-service component [] organize the components folder into log... [13:48:30] (I'm around, working on dashiki semantic upgrade) [13:48:49] (hello semantic Dan :) [13:50:07] hi Luca, I thought you were taking today off? [13:58:47] nope, I'll take a couple of hours tomorrow to catch my flight [13:59:24] Did I tell you guys a different thing? :O [13:59:53] elukey, just saw your ping. What happened with stat1003? [14:00:06] halfak: it was rebooted for a kernel security update [14:00:29] Damn. This was unusually bad timing. I lost a lot of work. [14:00:45] :( [14:01:19] really sorry halfak, I noticed the screen session and the crons but I wasn't sure about them [14:01:37] yesterday I sent an email to analytics, research and engineering about the reboots [14:01:58] elukey, gotcha. I should have seen and protested then. No worries. [14:02:08] halfak: is there a better way to notify? Sorry you lost work [14:02:18] yeah I was about to ask the same [14:02:49] Not sure. Let me look at the email. [14:03:08] maybe like [REBOOTING] in the subject line? and people can filter for it? [14:03:34] elukey: related-ish question: the replication of data to eventlogging seems to have stopped yesterday [14:03:34] I don't see an email to research [14:03:58] Maybe it didn't go to the right email address. [14:04:24] ottomata: Hi ! [14:04:33] ottomata: Would you have a minute for me? [14:04:42] halfak: it's [Analytics] Upcoming reboots of stat and Hadoop hosts due to Kernel upgrades [14:04:57] it went to wiki-research-l (forwarded by me) and analytics-l, then to engineering separately [14:05:05] milimetric, ahh. It went to wiki-research-l! [14:05:19] should we hit the internal one instead? [14:05:24] That might be the confusion. Yeah. I think so. [14:05:39] Still I should have seen it on the analytics lists. [14:05:50] Maybe we can have a direct ping list for those machines. [14:06:17] It's sort of like, E.g., you wanted to reboot ORES VMs and contacted me via labs-l [14:06:41] yeah, it's hard without a smaller specific list [14:06:43] I think there's only a few of us who regularly use stat100(2|3) for long term jobs. [14:07:04] Maybe I can make one and suggest people put their contact on wikitech [14:07:24] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2658860 (Ottomata) Actually, by both I didn't mean websockets and SSE. I meant HTTP with SSE at one endpoint, and then HTTP on a non SSE endpoint, on which... [14:07:33] joal: hiya sure! [14:07:44] oh we gotta do some puppet hting! [14:07:51] ottomata: batcave? [14:07:51] so I see two options: 1. make a new public list for notices like this or 2. craft subject lines such that people can set up filters and send a note about the convention [14:07:56] or IRC? [14:08:05] sho [14:08:09] bc [14:08:31] elukey: what I was going to ask is, do you know if that replication script runs on one of those stat boxes and needs to be restarted manually? [14:08:40] (I'll ask in -databases too) [14:09:10] anyway, milimetric & elukey, thanks for the chat. I'll think some more and come back with a proposal. [14:09:28] FWIW, I see missing this as my fault and something I should work on fixing. [14:10:55] I'll be available to chat about your proposal when you'll be ready :) [14:11:24] milimetric: mmmmm not that I know [14:11:25] checking [14:11:40] when did it stop? [14:11:59] it looks like some tables are about 6-7 hours behind and one of them was 24 hours behind [14:14:23] ah milimetric it seems to be running for role::mariadb::analytics::custom_repl_slave, that IIRC is one of the analytics nodes [14:14:26] checking [14:14:58] no I was wrong :D [14:17:17] come to -databases, elukey, jaime's looking at it [14:18:52] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2658903 (mobrovac) To me, the most compelling reason for using WS are browsers. I share your feelings: > But, I'm a backend engineer, and I betcha there are... [14:42:51] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2659104 (Nuria) > I'm a backend engineer, and I betcha there are lots of folks who want to consume a stream in a browser. Answer is yes. Even if it is for "s... [14:59:35] elukey: Do you mind pouting back an1001 as yarn master? [14:59:51] it causes troubles accessing UIs (even with tunnels) [15:01:02] mforns: milimetric standduppp [15:06:00] joal: sure :) [15:08:52] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2659235 (Ottomata) > Why the latter? If we exclude for a moment browser-based solutions from the discussion, what would be the benefits in using the latter s... [15:27:18] Analytics: Kill limn1 - https://phabricator.wikimedia.org/T146308#2659277 (Nuria) [15:30:44] !log analytics1001 is back Yarn/HDFS master [15:30:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:31:27] joal: done :) [15:32:33] thanks elukey :) [15:41:14] joal: [@analytics1027:/mnt/hdfs/user/hive] 2 $ ls [15:41:15] hive-site.xml warehouse [15:41:47] Thanks a milion ottomata !!! [15:42:04] Will test, then sublit CRs for refinery :) [15:42:07] ottomata: --^ [15:42:30] k cool [16:00:25] gotta run out to apt. bbl [16:16:21] Analytics: Replacing standard edit metrics in dashiki with data from new edit data depot - https://phabricator.wikimedia.org/T143924#2659403 (Nuria) [16:18:24] Analytics: Replacing standard edit metrics in dashiki with data from new edit data depot - https://phabricator.wikimedia.org/T143924#2659406 (Nuria) [17:13:31] Analytics-Kanban: Productionize Pivot UI - https://phabricator.wikimedia.org/T138262#2659496 (elukey) [17:13:51] Analytics, Research-and-Data, Research-collaborations, Research-management, Patch-For-Review: Oozie job to extract data for WDQS research - https://phabricator.wikimedia.org/T146064#2659497 (leila) @nuria re your comment on https://gerrit.wikimedia.org/r/#/c/311964/ about the IP address: it's... [17:15:41] Analytics-Kanban: Upload the final version of the pivot repo and load test - https://phabricator.wikimedia.org/T146389#2659499 (elukey) [17:16:30] Analytics-Kanban: Upload the final version of the pivot repo and load test - https://phabricator.wikimedia.org/T146389#2659499 (Nuria) Siege would be of help here: https://wikitech.wikimedia.org/wiki/Analytics/Siege [17:17:05] Analytics-Kanban: Upload the final version of the pivot repo and load test - https://phabricator.wikimedia.org/T146389#2659521 (Milimetric) [17:26:35] joal: you wanted to brain bounce now? isn't it late? [17:27:01] I have 1/2h before my wife comes and shout at me ;) [17:27:02] let's go for it ! [17:27:34] milimetric: --^ [17:27:40] milimetric: to the batcave ! [17:27:59] omw [17:28:30] mforns: talking with milimetric in the cave about weird pages [17:28:38] mforns: just if you fee like it :) [17:28:52] joal, there's this meeting in 2 minutes though [17:29:01] oh yeah ! [17:29:03] no problemo [17:31:36] nuria_: ping meeting [17:31:41] nvm :) [18:01:03] Analytics-Cluster, Operations, ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#2659725 (Dzahn) alright, handing it over to ops-eqiad now. the server has been shutdown, public IP removed from DNS, removed from puppet/icinga etc. please go ahead with physical decom steps, disk... [18:01:19] Analytics-Cluster, Operations, ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#2659739 (Dzahn) a:Dzahn>None [18:02:42] Analytics-Cluster, Operations, ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#2637714 (Dzahn) location: eqiad row B, B4 @ 13 [18:09:14] milimetric: do you have context about future fieldsĀ and events yall might need for mediawiki history project? [18:09:23] there is a patch from collab team that is adding some fields [18:09:30] want to make sure they make sense from yall's perspective [18:16:55] https://gerrit.wikimedia.org/r/#/c/312274/1 [20:15:34] ottomata if you are there.. could you please install siege in aqs1001? [20:15:52] ottomata: i want to run load tests on the old hosts to sanity check numbers [20:16:05] can do! [20:16:10] ottomata: maythnaks sir [20:16:13] *thanks [20:16:30] done. [20:18:34] wow [20:18:36] k [21:14:30] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2660509 (Afandian) I see a lot of browser-based discussion here. I'd just like to contribute my voice as an existing consumer of the existing RC Stream and h... [21:17:19] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2660511 (Ottomata) @Afandian thanks for chiming in. Could you comment on what is easier for you in Java / Python? socket.io vs a more simple streamed HTTP... [21:42:06] Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1195035 (Quiddity) This feature would be incredibly helpful, IIUC. I have two tasks that require checking things across all our projects, and I don't know how else to do it, oth... [21:47:28] Analytics-Kanban, EventBus, Services, Wikimedia-Stream, User-mobrovac: Public Event Streams - https://phabricator.wikimedia.org/T130651#2660674 (GWicke) On the Java SSE client front, I see these two options: - https://github.com/michaelklishin/eventsource-netty5 - https://jersey.java.net/api... [22:39:16] (PS1) Milimetric: Upgrade to semantic 2 everywhere [analytics/dashiki] - https://gerrit.wikimedia.org/r/312430 (https://phabricator.wikimedia.org/T118846) [22:39:28] (PS2) Milimetric: [WIP] Upgrade to semantic 2 everywhere [analytics/dashiki] - https://gerrit.wikimedia.org/r/312430 (https://phabricator.wikimedia.org/T118846) [23:25:24] (CR) Nuria: "Let me know when you think you are ready and I will test layout in chrome/ff/safari" [analytics/dashiki] - https://gerrit.wikimedia.org/r/312430 (https://phabricator.wikimedia.org/T118846) (owner: Milimetric) [23:27:18] (CR) Nuria: [WIP] Upgrade to semantic 2 everywhere (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/312430 (https://phabricator.wikimedia.org/T118846) (owner: Milimetric) [23:43:28] Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1195035 (yuvipanda) @Quiddity having some form of official resources dedicated to it might be helpful. I unfortunately don't think I'll have any bandwidth to be able to look at...