[00:11:23] <cdanis>	 re: the iOS page at 22:00 again... we did serve some 429s but not many.  I think the threshold is a bit too low, actually
[00:11:51] <cdanis>	 also it looks like it's still the UA version that is before they started smearing the requests over time... I'll ask about that tomorrow, although I suspect the App Store is just being slow to approve
[06:05:18] <_joe_>	 cdanis: it's usually semi-unpredictable
[06:07:52] <_joe_>	 (app store approval times)
[10:36:19] <kormat>	 elukey, klausman: cron spam to root@ from prometheus-amd-rocm-stats on an-worker1101
[10:36:42] <klausman>	 Looking into it
[10:36:47] <kormat>	 cheers
[10:38:38] <klausman>	 should stop spamming now
[10:57:19] <_joe_>	 klausman: nothing kill -9 $(pgrep exim) can't solve, right?
[10:57:58] <klausman>	 more like iptables -I OUTPUT 1 -m tcp -p tcp --dport 25 -j DROP
[10:59:10] <_joe_>	 I think that for added fun you could just drop 50% of the packets
[10:59:34] <jynus>	 nah, give it 1% for maximum evilness
[11:00:07] <_joe_>	 well the goal is to prevent emails from being sent
[11:00:20] <klausman>	 I always wanted to play this prank: pick a machine a lot of people are doing development on (like a big lab server at a uni). Add a cronjob that sleeps a random amount of seconds, check the process list for anything that has an argv[0] that starts with "./" and sends it a SIGSEGV, SIGILL or SIGFPE
[11:00:32] <jynus>	 ah, and 1% may not be too noticeable with tcp error correction
[11:00:48] <_joe_>	 ahahahahahah klausman
[11:01:10] <_joe_>	 now I'm envious I didn't think about it when I was in the university
[11:01:43] <_joe_>	 that's /really/ evil
[11:02:23] <klausman>	 Yeah, even if I had the chance, I would probably not do it.
[11:02:33] <klausman>	 Maybe on April 1st
[11:03:02] <_joe_>	 klausman: but the evilest thing I've seen was a student who didn't want to use a VCS, so he would copy his programs from the uni to home with floppies (yes, I'm old)
[11:03:49] <_joe_>	 one time he wanted to work on a project and he copied the files already present on the uni workstation to his home computer, recompiled everything at home, then brought the code back and copied it back
[11:03:54] <_joe_>	 including the .o files
[11:04:12] <_joe_>	 recompiled the executable on the workstation, and left for the day
[11:04:47] <_joe_>	 imagine me the next day, trying to run this program (that woudl perform fits of spectrum data) and find all kinds of weird runtime errors
[11:05:08] <_joe_>	 because his home machine used gcc 2.x vs gcc 3.x or something
[11:05:17] <kormat>	 _joe_: ow :)
[11:05:25] <_joe_>	 I realized it after like 1 hour in gdb
[11:07:00] <_joe_>	 the gist being: involuntary evil is usually more vicious :P
[11:11:52] <klausman>	 Yeah. Hopefully he learned his lesson :)
[13:37:10] <kormat>	 is there documentation on how we handle third-party debian packages? e.g. ones that come from a github repo's release
[13:40:29] <_joe_>	 kormat: you mean upstream has a debian directory?
[13:40:41] <kormat>	 _joe_: i mean upstream produces binary .debs
[13:40:47] <_joe_>	 we don't use them
[13:40:54] <kormat>	 in this case https://github.com/openark/orchestrator/releases/tag/v3.2.3
[13:41:18] <_joe_>	 but I'll let moritzm make the argument in my place :)
[13:41:37] <kormat>	 _joe_: https://wikitech.wikimedia.org/wiki/APT_repository#Repository_Structure implies that at least _some_ binary debs are synced in
[13:41:56] <_joe_>	 kormat: from other apt repositories IIRC
[13:42:04] <kormat>	 ah i see
[13:42:20] <_joe_>	 other stuff is usually rebuilt, but again, I'm far from being authoritative on the topic
[13:42:31] <_joe_>	 we've got people with much more experience than me
[13:42:45] * _joe_ looks at all the debian developers in the room
[13:44:15] * ema leaves the room
[13:53:10] <_joe_>	 coward
[13:53:23] <sukhe>	 not the authority on this by any means as well, but I asked the same question and was told we don't use third-party debs. so I rebuilt (and have been rebuilding) the package as well
[13:57:24] <kormat>	 so we'll sync in third-party debs from third-party apt repos, but we won't upload third-party debs that don't come from apt repos?
[13:59:15] <klausman>	 Analytics imports from an upstream Apt repo using reprepro. But we don't hand-import individual .debs (I think that's what you're asking about)
[13:59:28] <kormat>	 klausman: 👍
[14:06:02] <_joe_>	 kormat: correct, that was what I was saying
[14:06:16] <_joe_>	 the reason is we can trust gpg sign... oh, wait
[14:06:17] <kormat>	 _joe_: that seems.. arbitrary?
[14:06:39] <_joe_>	 kormat: not really. packages in an apt repo need to be signed with a key you trust
[14:06:50] <kormat>	 in this case upstream uses a dockerfile, a bunch of build scripts, and fpm to create the packages
[14:07:09] <_joe_>	 oh, god. why is everything like this
[14:11:00] <kormat>	 so building on deneb might not even work
[14:11:25] <cdanis>	 you might need something similar to envoy's build
[14:11:27] <kormat>	 (and would probably not be worth much anyway)
[14:11:36] <kormat>	 cdanis: *braces self* oh?
[14:11:49] <cdanis>	 a WMCS VM
[14:11:54] <kormat>	 ...
[14:12:51] <_joe_>	 kormat: discuss it with the security people
[14:13:17] <_joe_>	 I think this is not worth the effort of not importing binary debs, but others might have dissenting opinions
[14:13:48] <kormat>	 it feels very make-work'y
[14:14:10] <kormat>	 if the github account is not trustable, building our own binaries solves nothing
[14:14:32] <_joe_>	 kormat: it's a bit subtler than that, but again, it's cost/rewards as usual
[14:14:54] <_joe_>	 I'd import those binary debs, if I had to choose. But again, let's hear from others
[14:15:10] <kormat>	 ah. i misread your double-double negative
[14:15:14] <kormat>	 i'll open a task.
[14:15:15] <marostegui>	 Yeah, importing them would save lots of time
[14:17:58] <kormat>	 https://phabricator.wikimedia.org/T266023 for anyone who wants to popcorn-gallery
[14:25:57] <ottomata>	 godog: o/
[14:26:04] <ottomata>	 > The canonical url is https://thanos-swift.discovery.wmnet
[14:26:41] <ottomata>	 godog: my knowledge of swift is covered in spider webs.  To reach it from analytics VLAN, we need to add rules to the network ACLS
[14:26:46] <ottomata>	 for which we need IPs (I think)
[14:26:57] <ottomata>	 to write to swift, do we just need that discovery IP?
[14:27:01] <ottomata>	 or do we need all the nodes?
[14:28:04] <ottomata>	 i guess since this is jsut for testing
[14:28:14] <ottomata>	 we can just use the eqiad thanos swift cluster? (assuming there are 2?)
[14:29:21] <cdanis>	 ottomata: I believe you should add both thanos-swift.svc.codfw.wmnet. and thanos-swift.svc.eqiad.wmnet. -- traffic might need to go cross-cluster in the event of maintenance on the eqiad thanos-swift
[14:32:30] <cdanis>	 ottomata: and yeah, as usual with LVS services, you only need to add the service IP
[14:33:08] <godog>	 ottomata: ah yes of course, what cdanis said :)
[14:34:21] <godog>	 ottomata: thanos-swift is a little different in the sense that there's one multi-site cluster, but yes still two service IPs
[14:38:08] <ottomata>	 godog:  and the client only needs to talk to those IPs?
[14:38:18] <ottomata>	 no parallel node uploads or anything?
[14:38:44] <ottomata>	 oh c danis  just said yes ok
[14:39:51] <ottomata>	 oh, godog  ports?
[14:39:56] <ottomata>	 just 80?
[14:40:10] <godog>	 ottomata: 443 only
[14:40:12] <ottomata>	 ok
[14:40:27] <godog>	 I'll update the task
[14:40:35] <ottomata>	 just did too...
[14:40:43] <godog>	 haha! even better
[14:41:04] <_joe_>	 ottomata: you should add listeners to envoy, and use it to connect to thanos-swift I guess
[14:41:10] <_joe_>	 if you're running in k8s
[14:41:41] <ottomata>	 _joe_:  naw this is for search to test storing flink snapshots and checkpoints in  sswift
[14:41:47] <ottomata>	 they 're currently running in hadoop
[14:41:52] <ottomata>	 so from analytics vlan -> swift
[14:42:12] <ottomata>	 https://phabricator.wikimedia.org/T246004
[17:37:33] <ryankemper>	 Can someone help me sanity check something real quick? I'm trying to figure out why https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 went critical
[17:37:48] <ryankemper>	 it looks like in the puppet repo the code for that alert lives here: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L35-L41
[17:38:10] <ryankemper>	 Yet the alert is on port `9243` whereas only `9200` is defined here: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L6
[17:39:09] <ryankemper>	 That would seem to imply there's another codepath hiding somewhere, but when I rip-grep for a portion of the alert description, `ElasticSearch shard size check`, only that one file pops up
[17:40:21] <cdanis>	 ryankemper: so, the [9200] in that file is only a default -- it seems likely that base_checks is being instantiated multiple places
[17:40:42] <ryankemper>	 Thanks, yeah I just happened to stumble across `modules/profile/manifests/elasticsearch/monitor/base_checks.pp` which appears to override it
[17:41:16] <ryankemper>	 (https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/profile/manifests/elasticsearch/monitor/base_checks.pp#L9-L11)
[17:42:03] <ryankemper>	 Okay so if this alert fired that must mean that `lookup('profile::elasticsearch::monitor::shard_size_warning'` is actually returning a value so it's not using the default instantiated later in this line: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/profile/manifests/elasticsearch/monitor/base_checks.pp#L2
[17:42:27] <ryankemper>	 Okay I think I can take it from here, thanks for taking a look
[17:42:37] <cdanis>	 I see that being overridden in hiera, but, only for some of the logstash roles
[17:42:49] <ryankemper>	 Hmm
[17:43:53] <ryankemper>	 When I deployed it yesterday I re-forced checks on all the warnings that were firing at the time, and those warnings went away which told me the new warning threshold took
[17:44:26] <ryankemper>	 So that's why I was pretty surprised when the critical fired today...unless the logstash role is used for some of the cirrus instances too
[18:01:12] <cdanis>	 ryankemper: it's not the case that puppet is disabled on those hosts, right?
[18:03:16] <ryankemper>	 cdanis: I'm not aware of that, but that's an interesting theory cause it could explain the behavior
[18:03:25] <ryankemper>	 How do I figure out which actual host ran this check? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243
[18:03:35] <ryankemper>	 It just lists `search.svc.eqiad.wmnet` as the host in that case
[18:03:58] <ryankemper>	 But that hostname doesn't expose an ssh port so not sure what the real host is
[18:05:03] <ryankemper>	 Would it be any host that requires `icinga::monitor::elasticsearch::base_checks` or something?
[18:07:42] <cdanis>	 so, there's actually two things packed into that question, one being 'where do the queries to that virtual service IP go', and the other being, 'how does icinga know about that as a host'
[18:08:00] <cdanis>	 the short answer in both cases is "via a pile of stuff instantiated in Puppet"
[18:09:03] <cdanis>	 there's an... interesting connection between port numbers and various service names here
[18:09:21] <cdanis>	 so the easiest way to answer the former, fully, is looking at lvs1017.eqiad.wmnet `sudo ipvsadm -L`
[18:09:30] <cdanis>	 TCP  search.svc.eqiad.wmnet:9243 wrr
[18:09:32] <cdanis>	   -> elastic1044.eqiad.wmnet:9243 Route   10     0          0
[18:09:34] <cdanis>	   -> elastic1045.eqiad.wmnet:9243 Route   10     0          0
[18:09:36] <cdanis>	   -> elastic1053.eqiad.wmnet:9243 Route   10     0          0
[18:09:38] <cdanis>	 [...]
[18:11:27] <cdanis>	 so I haven't looked at how that monitoring script is written yet, but, it likely hits a random one of those pool of machines
[18:12:03] <cdanis>	 as for the second question, it's pretty common to provide a service IP name as a 'host' to icinga, and it's likely that the checks in this case are instantiated on the icinga host directly via an include
[18:13:04] <cdanis>	 although I haven't actually found where that happens yet
[18:16:25] <cdanis>	 it might be that all the nodes include ::profile::elasticsearch::monitor::base_checks (likely in the role), and then that creates a bunch of duplicated exported resources for icinga
[18:19:11] <ryankemper>	 thanks, the above was quite heplful