[07:00:23] Analytics-Cluster, Analytics-Kanban, Operations, ops-eqiad: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2408707 (elukey) @Cmjohnson the server is showing another disk failure, it seems that we are not lucky with this one: ``` [1600281.300136] EXT4-fs error... [07:01:24] analytics1049 has another failed disk :/ [07:08:50] stopped it completely to remove it from the cluster, the phab task has been updated :) [08:38:26] Hi elukey [08:38:32] Thanks for caring the disks [08:38:53] a-team: Due to disk failure, some oozie jobs have failed yesterday, everything has been relaunched [08:39:07] thanks! [08:39:38] elukey: any news on network for cassandra? [08:41:45] nope, need to chat with Alex [08:41:56] k [08:41:58] :) [08:42:00] Thnaks [09:29:16] Analytics: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2408827 (elukey) [09:33:03] Analytics: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2408842 (elukey) [09:57:45] Analytics: Evaluate alternatives to varnishkafka: varnishevents - https://phabricator.wikimedia.org/T138426#2408902 (elukey) [10:52:33] Analytics-Tech-community-metrics, Developer-Relations: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2408971 (Lcanasdiaz) [10:57:13] joal: https://phabricator.wikimedia.org/T138609#2408949 - you are free to test :) [10:58:02] elukey@analytics1030:~$ telnet aqs1006.eqiad.wmnet 9160 [10:58:02] Trying 10.64.48.146... [10:58:03] telnet: Unable to connect to remote host: Connection refused [10:58:33] that is better elukey@analytics1030:~$ telnet aqs1006-a.eqiad.wmnet 9160 [10:58:36] Trying 10.64.48.148... [10:58:38] \o/ [10:58:40] Connected to aqs1006-a.eqiad.wmnet. [10:58:42] YAY ! [10:58:46] Thanks elukey :) [10:58:49] Testing again ! [11:00:56] Analytics: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2408992 (ema) p:Triage>Normal [11:01:44] Analytics: Evaluate alternatives to varnishkafka: varnishevents - https://phabricator.wikimedia.org/T138426#2408994 (ema) p:Triage>Normal [11:03:00] Wow elukey, huge drop in bytes from aqs100[123] since 20mins [11:03:04] elukey: any idea? [11:04:11] arf, my bad elukey, was because of loading not happening at expe3cted time [11:04:16] elukey: sorry for false alarm [11:05:21] ah yes I was wondering looking at https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system [11:05:24] goooood [11:05:43] joal: https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings#Proposal [11:05:56] I am writing this document to recap everything [11:06:05] than I'll send it to ops for review [11:06:05] elukey: looks like I still have issues:( [11:06:10] should be completed today [11:06:33] awesome :) [11:06:45] joal: expected, whatever we do together can't be resolved in less than a week :P [11:06:54] huhu [11:07:07] do we get a different error now? [11:07:10] or is it the same? [11:07:15] no error yet [11:07:19] waiting [11:07:42] ah sorry I thought "looks like I still have issues:(" was related to AQS loading [11:08:07] yes it is, but error takes some time to show up (looks like similar to previous [11:09:41] actually elukey yes, error is different: connection timed out [11:23:09] ah snap [11:23:26] elukey: investigating more [11:23:28] theoretically the connection should work, timeout is a bit weird [11:23:30] checking logs [11:23:39] great [11:23:40] is there a specific host ? [11:23:47] or instance [11:23:55] elukey: It will inform me if I know that connection happenend :) [11:23:57] elukey: any [11:24:26] aqs1005-a as my first example, but each had the same error [11:25:56] could it be something client specific? [11:26:05] That's what I'm after [11:26:23] super [11:28:49] joal: maybe write_request_timeout_in_ms=2000 ? [11:30:52] mm maybe not [11:31:05] didn't find any particular error msg in the logs [11:31:23] elukey: trying again with another spec [11:34:12] not related, but from 2.1 datastax docs [11:34:13] The compaction strategy DateTieredCompactionStrategy precludes using read repair, because of the way timestamps are checked for DTCS compaction. In this case, you must set read_repair_chance to zero. For other compaction strategies, read repair should be enabled with a read_repair_chance value of 0.2 being typical. [11:34:36] https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesReadRepair.html [11:35:15] Analytics, Revision-Slider, TCB-Team, WMDE-Analytics-Engineering, and 2 others: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861#2409042 (WMDE-leszek) [12:00:27] Maaaaaan [12:01:10] elukey: I need to apologize, seems that 9160 is not the only one needed [12:01:43] elukey: storage port is another one :( [12:01:45] 7000 [12:06:01] hi team!! [12:06:08] Hey mforns ! [12:06:24] how are you doing guys :] [12:06:39] not bad ! [12:06:41] you? [12:06:50] good :] [12:07:43] * mforns catches up with email [12:19:59] mforns: helloooo [12:20:05] hi! [12:20:06] how are you? [12:20:09] goood :] [12:20:11] you? [12:20:15] goooood [12:20:23] joal: ouch also 7000 [12:20:25] do you guys have news from madhuvishy? [12:21:01] mforns: last news she had a passpotrt [12:21:13] aha I saw that, cool [12:21:25] elukey: Actually doc is very obscure on which ports for what [12:21:48] But from the code, looks like 7000 is used :( [12:22:06] and I can't find 9160 used in code [12:22:22] So, looks like I shouldn't have trusted docs :( [12:22:26] elukey: --^ [12:22:48] * joal is tired of cassandra fight - Hopefully cassandra is even more [12:23:29] joal: we can have a looked at dropped packets for a query if you want to have a definitive answer [12:23:33] joal: we can have a look at dropped packets for a query if you want to have a definitive answer [12:23:56] moritzm: that's be great [12:24:18] moritzm: only issue is that it comes from hadoop, so changing hosts [12:24:41] But I can tell you once the job has started [12:26:00] moritzm: would that work? [12:31:20] sure, but the logging rules need to be setup on the receiving hosts, if that one is non-volatile, that's no problem [12:31:51] the weird thing is that port 7000 seems used for internode communications https://docs.datastax.com/en/cassandra/2.1/cassandra/security/secureFireWall_r.html [12:31:56] moritzm: issue is that there seem to be ACLs in the middle as well [12:32:04] elukey: I know !!! [12:32:37] elukey: I promise it's the last time I deal with a goddess ! [12:33:01] maybe it is a trick to let the node believe that the sstables are coming from the cluster [12:33:34] I am interested in the logging rules too [12:33:44] (I guess that those will be iptables one right?) [12:33:45] elukey: possible [12:33:54] which hosts are we talking about here? [12:34:02] port 7000 is already allowed in cassandra [12:34:08] aqs100[456]-[a] [12:34:16] aqs100[456]-[ab] sorry [12:34:28] let me have a look, I think I know the problem [12:34:34] moritzm: woth data coming from analytics network? [12:38:25] it is traffic from analytics* (hadoop) to cassandra port 7000 [12:38:32] not sure if allowed by net ACL [12:38:39] and also by ferm [12:38:43] so, in role::aqs port 7000 is allowed for cassandra_hosts_ferm [12:38:51] which is generated from Hiera: [12:38:55] cassandra::seeds [12:39:48] but cassandra::seeds for aqs only contains the names of the actual aqs hosts [12:39:50] moritzm: can't connect from an analytics machine [12:40:11] but not the IP addresses/hostnames assigned to the -a and -b instances [12:40:27] so this might be the problem [12:40:40] k [12:41:02] can we have a specific example, from where does the query originate and which hosts did it reach? [12:42:35] moritzm: I tried to telnet from analytics1048 to aqs1004-a on port 7000 [12:50:25] so that won't work with the current rules, access to 7000 on aqs1004 is only allowed from the IP addresses of the Cassandra instances, not from anywhere else [12:50:40] so aqs100[4-6]-[ab] at the moment [12:51:05] which software package instead of telnet would access port 7000 from the hadoop cluster? [12:54:49] port 7000 is documented for inter-node communication only, though, so I'm wondering whether we actually need this? [12:55:31] moritzm: I'm trying to bulk load SStables into cassandra [12:55:47] same doubts that I have [12:55:53] feels wrong to use 7000 [12:56:11] but I am sure that it is something to trick the nodes to accept sstables [12:56:37] urandom: Hi! Any thoughts about port 7000 used for SSTables bulk load? [12:59:48] (brb) [13:17:06] Analytics-Tech-community-metrics: Mediawiki support to be added to GrimoireLab - https://phabricator.wikimedia.org/T138007#2409146 (Lcanasdiaz) The Mediawiki support for Perceval is being finished this week. [13:18:09] Analytics-Tech-community-metrics, Developer-Relations: Deployment of Demography panel - https://phabricator.wikimedia.org/T138757#2409147 (Lcanasdiaz) [13:24:58] Hey mforns I got my passport this morning. Just got US visa approved. They said they'll call tomorrow morning and I can go to the embassy and get it [13:29:12] madhuvishy: \o/ [13:31:50] madhuvishy, cooooool... [13:31:54] finally [13:32:45] did you make it to Esino Lario? [13:43:57] mforns: yeah! :) I did [13:44:01] On wednesday [13:44:07] Came to milan this morning [13:44:38] elukey: :) [14:22:03] a-team: first draft of what we discussed in Berlin - https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings [14:22:15] probably still full of typos and things to add [14:22:19] early comments are welcome :) [14:22:59] elukey, at first glance looks amazing [14:25:43] thanksss!! I still need how to place those images since I am a wikitext newbie [14:33:00] elukey: context? [14:33:09] Hi urandom [14:33:14] hi! [14:33:34] I'm having trouble bulk feeding cassandra [14:33:39] * urandom is reading the backlog [14:33:48] ports to be used is one of the issue [14:36:28] urandom: o/ [14:38:28] the bulk loading uses streaming [14:38:39] (sorry my internets is janky) [14:39:06] but yeah, 7000 sounds right [14:39:16] mwarf [14:39:24] is this bad? [14:39:35] * joal apologizes to elukey for having noticed the wrong port :( [14:40:30] urandom: doc on streaming is not really proficient, so I've been fighting a bit on how it works, and asked elukey (and ops) to open 9160 [14:41:00] mmm [14:41:03] urandom: It's just about me asking too many things, nothing really wrong [14:41:22] bulk loading is an advanced subject i guess [14:41:31] urandom: we allow traffic coming from the other nodes of the AQS cluster to port 700 [14:41:34] *7000 [14:41:46] meanwhile we should allow traffic from hadoop to enable this [14:41:53] not a big deal but it felt weird [14:41:54] :) [14:42:12] you're right urandom, and all the doc is about using sstableloader tool, not the java classes [14:42:28] Thanks for the confirmation urandom :) [14:42:42] the hadoop node(s) streamig data are technically nodes for the duration of the stream [14:42:51] Cassandra nodes that is [14:43:11] from a messaging perspective anyway [14:43:12] elukey: Do you think it's possible to get port 7000 openned? [14:43:16] is that bulk loading one time thing or an ongoing task? [14:43:19] makes sense urandom [14:43:44] moritzm: currently one shots, we are learning, but in the end if it works as exepected, could become ongoing [14:44:59] will that require access from any hadoop node or will that happen from a designated one? [14:45:48] moritzm: any, as always with hadoop, we can't know which of the nodes will be designated to execute the task [14:46:01] ok [15:37:13] elukey, moritzm: sorry to bother again, but I'm not sure of what I can expect from you guys :) [15:37:44] Are you opening the 7000 ports? if yes, when? [15:41:35] One thing that it was proposed before was to create a hiera config for the hadoop nodes only, to allow only those ones rather than the whole analytics network in ferm [15:41:38] might be an option [15:44:53] Would work for me elukey :) [15:45:38] I think we should wait for moritzm advice since we might risk to keep opening ports to achieve a goal opening big security holes :) [15:46:00] ok elukey :) [15:46:15] elukey: yeah, that sounds good, but we need it tweakable via Hiera so that it only applies to the 7000 port of aqs, not to cassandra as used in restbase [15:46:17] I'll be completely fine to close those holes once we find the working ones ! [15:48:10] moritzm: would something like https://gerrit.wikimedia.org/r/#/c/295907/3/manifests/role/aqs.pp be ok? Maybe not with the whole analytics network but only the hadoop nodes [15:56:08] yeah, let's rather limit to the hadoop nodes [16:33:29] all right, going offline team! [16:33:33] see you tomorroW1 [16:33:35] w! [16:33:36] :) [16:47:28] Bye elukey [19:55:19] bye a-team!! [19:55:26] Good night mforns [19:55:36] night joal1 [19:55:38] ! [21:32:16] (PS5) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [21:33:23] (CR) Addshore: "@joal So AFAIK this is now correct." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)