[05:28:25] (03CR) 10Joal: "One last small round of things - A bizarre check, a naming discussion and an idea for later patch." (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) (owner: 10Ottomata) [08:37:05] going to the dentist, will be back in a few :) [08:52:40] elu_key: snap! :P [10:04:14] back! [10:05:55] https://issues.apache.org/jira/browse/HIVE-8337 [10:06:11] so from our version, I think that perms for hive are inherited from the parent dir [10:06:25] joal, elukey: another question about java builds. I'm cleaning up the release process and I see that maven-release-docker always sends email but update-jars-docker does not send email on failure [10:06:59] context: https://github.com/wikimedia/integration-config/blob/master/jjb/analytics.yaml#L74 and https://github.com/wikimedia/integration-config/blob/master/jjb/analytics.yaml#L153 [10:07:24] is that on purpose? Or could I extract a common macro for both (and aligning the way emails are sent) [10:08:08] gehel: I don't think it is on purpose, for me it is super fine to extract a macro [10:08:25] cool! that's easier than adding parameters all over! [10:18:47] gehel: also keep in mind that my authority on these things is close to zero, so wait for other opinions as well :D [10:19:17] as an SRE, don't you have final authority on everything? [10:20:07] gehel: they let me believe that I have some authority, like you do with crazy people, nothing more :D [10:30:28] France works with crazy people. Let me know if I should ask her to steal some medication [11:33:00] * elukey afk! [12:54:33] hi elukey, how are you? [12:57:00] Hi gehel :) You were right in considering elukey has authority on everything :) [12:58:16] I've just realized that my venvs stop working on stat1006, and 1007. I get a fatal error: Fatal Python error: Py_Initialize: Unable to get the locale encoding [13:52:02] dsaez: o/ We have upgraded the hosts to Debian 10 (that has python3.7) so if you haven't accessed them in a while I think that you need to re-create them [13:52:55] thx elukey. So, new venvs? [13:53:22] dsaez: yep! https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Resetting_user_virtualenvs [13:53:30] it should work afterwards [13:53:32] great thx [13:53:33] if not lemme know [14:23:47] (03CR) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) (owner: 10Ottomata) [14:23:49] (03PS7) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) [14:36:15] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Volans) p:05Triage→03Medium [15:31:35] (03PS2) 10Milimetric: [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) [15:37:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) (owner: 10Milimetric) [15:41:15] (03PS3) 10Milimetric: [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) [15:42:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) (owner: 10Milimetric) [15:42:44] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [15:43:20] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [15:46:38] (03PS4) 10Milimetric: [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) [15:53:00] !log point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768 [15:53:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:05] T270768: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 [15:53:05] T268028: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet - https://phabricator.wikimedia.org/T268028 [15:55:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet - https://phabricator.wikimedia.org/T268028 (10Ottomata) Nice, I just did my first failover too (due to T270768). Back at an-coord1001 now. :) [15:58:20] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Ottomata) This node should now be in standby mode and should be safe to take offline at any time. As it is in standby, I believe it should be fine to wait until after th... [16:37:27] (03PS1) 10Neil P. Quinn-WMF: Set up and document deployment strategy for jobs [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/651794 (https://phabricator.wikimedia.org/T261953) [16:55:17] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [16:57:27] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [16:58:25] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) These are racked, need bios setup [17:02:02] a-team, some of us are on pa sync-up, will be a couple mins late to standup! [17:03:27] a-team https://meet.google.com/knt-efmf-bzd?authuser=1 [17:03:36] standup with grant! [17:04:20] ottomata: ^ [17:24:18] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) Please ping Analytics before shutting down the host since there is a database running on it, so I'd prefer to do things gracefully and stop replication from an-co... [17:54:27] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [18:01:57] ottomata: would you have some time for me to talk about kafka and TLS ? [18:11:58] joal: o/ [18:12:18] elukey: shouldn't you be gone?> [18:12:19] so we got spaces in the racks for 18 an-worker nodes, that means ~864T [18:12:25] yes yes in theory [18:12:28] YEAH [18:12:28] :) [18:12:35] without the masters though [18:12:40] that is the big issue [18:13:01] for masters do we go 1G? [18:13:17] if we remove two nodes for the master, it will be -96T (12*48*2) [18:13:50] elukey: In any case, with that we can start copying - It'll rebalance when more nodes are added [18:14:21] we could add two of them to 1g and ask the dcops team to relocate later on, it could be an option [18:14:49] it is not a bad idea, I'll follow up in january about this [18:14:55] Cool [18:15:16] anyway, I wanted to check if 16/18 nodes were ok, and it seems that as starter they might [18:15:20] okok [18:15:28] I am trying to keep a maximum of 5 nodes for each rack [18:15:49] so if a switch goes down or if there is a power failure, the blast radious is that [18:15:57] I think our calculation said less than 800Tb - So yes we're ok - Not stretch but ok :) [18:16:13] but we could speed up the racking if we allowed say more nodes on the same rack, like 6/7, but I don't like it al ot [18:16:16] *a lot [18:16:22] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [18:16:43] I've seen that elukey - It's wise for sure [18:18:29] super I also wanted to double check it with you :) [18:18:56] It's all good elukey - We can start copying early next year! [18:20:11] joal: today I made a bold statement and an-coord1002 reacted breaking a disk, let's keep it quiet :D [18:20:28] :D :D :D [18:20:38] jokes aside, I really hope so! [18:21:24] * joal doesn't want to jinx anything :) [18:21:51] also elukey: I have a wroking gobblin with kafka client 1 - Now It's about adding tls :) [18:25:44] ah nice! Do you need settings? [18:26:10] I do! I pinged ottomata, I didn't want to ping you :( [18:26:19] Jumbo should be configured to allow clients to connect via TLS [18:26:34] lemme pull some from another service [18:29:03] kafka.security.protocol=SSL [18:29:03] kafka.ssl.ca.location=/var/lib/puppet/ssl/certs/ca.pem [18:29:03] kafka.ssl.cipher.suites=ECDHE-ECDSA-AES256-GCM-SHA384 [18:29:03] kafka.ssl.curves.list=P-256 [18:29:04] kafka.ssl.sigalgs.list=ECDSA+SHA256 [18:29:07] joal: --^ [18:29:24] this is what we set for kafkatee for example [18:29:33] Ack! testing [18:30:22] kafka.ssl.ca.location=/etc/ssl/certs/Puppet_Internal_CA.pem might also be needed [18:30:33] joal: on what node are you testing ? [18:31:07] elukey: I'm on stat1008 [18:31:43] elukey: java.lang.IllegalArgumentException: Unsupported CipherSuite: ECDHE-ECDSA-AES256-GCM-SHA384 [18:32:01] * elukey plays sad_trombone.wav [18:32:29] now I recall that Valentin had to file a pull request to librdkafka to support extra bits [18:32:53] maybe try to remove it and see if it works, we can follow up with upstream in case [18:37:13] joal: --^ [18:38:33] Nope I have not managed to make it work without :( [18:38:36] elukey: --^ [18:38:46] different error? [18:38:59] elukey: however I have seen that format is different for java from what I read (underscores for isntance) [18:39:09] yes elukey, different error [18:40:13] ahh I didn't know the different format, that one is librdkafka-related [18:41:07] I think so elukey [18:44:27] elukey: I got a working suite [18:44:59] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6706896, @Cmjohnson wrote: > @elukey that will work, I will add 1 to B4 and 2 to C2. Thanks! Followed up with Chris on IRC,... [18:45:41] nice! [18:46:40] elukey: error is even more bizarre now :( [18:47:04] org.apache.kafka.common.network.SslTransportLayer - Failed to send SSL Close message [18:47:07] java.io.IOException: Unexpected status returned by SSLEngine.wrap, expected CLOSED, received OK. Will not send close message to peer. [18:47:31] anyone: I was looking at some of the Hive queries used for some of the other Oozie jobs, and stumbled upon this https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_bycountry.hql#L34 and was wondering if Line 34 is necessary, because I couldn't find any references to `rn` elsewhere in the query [18:49:49] Hi lexnasser - I think that's a lefotver from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_articles.hql#L35 [18:50:15] lexnasser: the query for per-country has been copied away from the one from article, and in that original one rn is uswed :) [18:50:41] I think in your use case rn will be usefull lexnasser, I invite you to look at the query I pasted :) [18:51:48] joal: I see, thanks for the explanation! I'll refer to that query you sent [18:51:53] \o/ [18:53:04] joal: what ports did you use for the kafka brokers? [18:53:09] 9093 is the TLS one [18:53:14] AH! [18:53:32] * joal facepalms and hides [18:54:24] ahahhah nono I hope it doesn't lead to a more cryptic exception, this one was really ahrd [18:54:27] *hard [18:55:19] elukey: actually with correct port the job doesn't need any setting - Only encryption on :) [18:56:06] joal: so it works?? [18:57:16] ok - on port 9093 job fails if I comment source.kafka.security.protocol=SSL, works otherwise :) [18:57:24] elukey: this is a SUCCESS :) [18:57:34] elukey: not from hadoop yet, but from stat1008 [18:58:45] ohhhh really nice! [18:59:06] I'd really like to see stricter options enforced etc.. but if they don't work we can follow up with upstream [18:59:08] Will try to make it work from hadoop [18:59:09] this is really great [18:59:33] elukey: stricter you mean the various SSL options ou gave me? [18:59:38] yes exactly [18:59:44] I think the cipher-suite works [18:59:47] Will test [18:59:51] * elukey dances [19:00:27] Indeed elukey :) [19:01:52] elukey: I don't see options for curves or sigalgs in the list :( [19:02:09] And also elukey, I don't any ca location [19:09:39] joal: then it is probably working because the jvms in our infra by default accept the puppet CA, otherwise it would fail for sure [19:10:14] for the sigalgs I think that we'll need to file a github issue, but very minor :) [19:10:20] really happy that works! [19:10:44] So am I elukey! The little gift before holidays :) [19:10:50] definitely :) [19:10:54] going to dinner! o/ [19:12:58] Bye elukey - Enjoy holidays :) [20:10:02] Gone for tonight team - enjoy your holidays :) [20:12:08] me too leaving now, have fun these free days yall! hugs :] [20:28:34] laters all! <3 [20:29:07] oh joal sorry i missed your ping! was in meetings [21:18:00] Signing off! Have a happy end of the year all [21:36:57] (03PS1) 10Milimetric: Add noc.wikimedia to the whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/651844 [21:38:33] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add noc.wikimedia to the whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/651844 (owner: 10Milimetric) [21:43:43] happy end of the year raz zi!