[00:11:23] re: the iOS page at 22:00 again... we did serve some 429s but not many. I think the threshold is a bit too low, actually [00:11:51] also it looks like it's still the UA version that is before they started smearing the requests over time... I'll ask about that tomorrow, although I suspect the App Store is just being slow to approve [06:05:18] <_joe_> cdanis: it's usually semi-unpredictable [06:07:52] <_joe_> (app store approval times) [10:36:19] elukey, klausman: cron spam to root@ from prometheus-amd-rocm-stats on an-worker1101 [10:36:42] Looking into it [10:36:47] cheers [10:38:38] should stop spamming now [10:57:19] <_joe_> klausman: nothing kill -9 $(pgrep exim) can't solve, right? [10:57:58] more like iptables -I OUTPUT 1 -m tcp -p tcp --dport 25 -j DROP [10:59:10] <_joe_> I think that for added fun you could just drop 50% of the packets [10:59:34] nah, give it 1% for maximum evilness [11:00:07] <_joe_> well the goal is to prevent emails from being sent [11:00:20] I always wanted to play this prank: pick a machine a lot of people are doing development on (like a big lab server at a uni). Add a cronjob that sleeps a random amount of seconds, check the process list for anything that has an argv[0] that starts with "./" and sends it a SIGSEGV, SIGILL or SIGFPE [11:00:32] ah, and 1% may not be too noticeable with tcp error correction [11:00:48] <_joe_> ahahahahahah klausman [11:01:10] <_joe_> now I'm envious I didn't think about it when I was in the university [11:01:43] <_joe_> that's /really/ evil [11:02:23] Yeah, even if I had the chance, I would probably not do it. [11:02:33] Maybe on April 1st [11:03:02] <_joe_> klausman: but the evilest thing I've seen was a student who didn't want to use a VCS, so he would copy his programs from the uni to home with floppies (yes, I'm old) [11:03:49] <_joe_> one time he wanted to work on a project and he copied the files already present on the uni workstation to his home computer, recompiled everything at home, then brought the code back and copied it back [11:03:54] <_joe_> including the .o files [11:04:12] <_joe_> recompiled the executable on the workstation, and left for the day [11:04:47] <_joe_> imagine me the next day, trying to run this program (that woudl perform fits of spectrum data) and find all kinds of weird runtime errors [11:05:08] <_joe_> because his home machine used gcc 2.x vs gcc 3.x or something [11:05:17] _joe_: ow :) [11:05:25] <_joe_> I realized it after like 1 hour in gdb [11:07:00] <_joe_> the gist being: involuntary evil is usually more vicious :P [11:11:52] Yeah. Hopefully he learned his lesson :) [13:37:10] is there documentation on how we handle third-party debian packages? e.g. ones that come from a github repo's release [13:40:29] <_joe_> kormat: you mean upstream has a debian directory? [13:40:41] _joe_: i mean upstream produces binary .debs [13:40:47] <_joe_> we don't use them [13:40:54] in this case https://github.com/openark/orchestrator/releases/tag/v3.2.3 [13:41:18] <_joe_> but I'll let moritzm make the argument in my place :) [13:41:37] _joe_: https://wikitech.wikimedia.org/wiki/APT_repository#Repository_Structure implies that at least _some_ binary debs are synced in [13:41:56] <_joe_> kormat: from other apt repositories IIRC [13:42:04] ah i see [13:42:20] <_joe_> other stuff is usually rebuilt, but again, I'm far from being authoritative on the topic [13:42:31] <_joe_> we've got people with much more experience than me [13:42:45] * _joe_ looks at all the debian developers in the room [13:44:15] * ema leaves the room [13:53:10] <_joe_> coward [13:53:23] not the authority on this by any means as well, but I asked the same question and was told we don't use third-party debs. so I rebuilt (and have been rebuilding) the package as well [13:57:24] so we'll sync in third-party debs from third-party apt repos, but we won't upload third-party debs that don't come from apt repos? [13:59:15] Analytics imports from an upstream Apt repo using reprepro. But we don't hand-import individual .debs (I think that's what you're asking about) [13:59:28] klausman: 👍 [14:06:02] <_joe_> kormat: correct, that was what I was saying [14:06:16] <_joe_> the reason is we can trust gpg sign... oh, wait [14:06:17] _joe_: that seems.. arbitrary? [14:06:39] <_joe_> kormat: not really. packages in an apt repo need to be signed with a key you trust [14:06:50] in this case upstream uses a dockerfile, a bunch of build scripts, and fpm to create the packages [14:07:09] <_joe_> oh, god. why is everything like this [14:11:00] so building on deneb might not even work [14:11:25] you might need something similar to envoy's build [14:11:27] (and would probably not be worth much anyway) [14:11:36] cdanis: *braces self* oh? [14:11:49] a WMCS VM [14:11:54] ... [14:12:51] <_joe_> kormat: discuss it with the security people [14:13:17] <_joe_> I think this is not worth the effort of not importing binary debs, but others might have dissenting opinions [14:13:48] it feels very make-work'y [14:14:10] if the github account is not trustable, building our own binaries solves nothing [14:14:32] <_joe_> kormat: it's a bit subtler than that, but again, it's cost/rewards as usual [14:14:54] <_joe_> I'd import those binary debs, if I had to choose. But again, let's hear from others [14:15:10] ah. i misread your double-double negative [14:15:14] i'll open a task. [14:15:15] Yeah, importing them would save lots of time [14:17:58] https://phabricator.wikimedia.org/T266023 for anyone who wants to popcorn-gallery [14:25:57] godog: o/ [14:26:04] > The canonical url is https://thanos-swift.discovery.wmnet [14:26:41] godog: my knowledge of swift is covered in spider webs. To reach it from analytics VLAN, we need to add rules to the network ACLS [14:26:46] for which we need IPs (I think) [14:26:57] to write to swift, do we just need that discovery IP? [14:27:01] or do we need all the nodes? [14:28:04] i guess since this is jsut for testing [14:28:14] we can just use the eqiad thanos swift cluster? (assuming there are 2?) [14:29:21] ottomata: I believe you should add both thanos-swift.svc.codfw.wmnet. and thanos-swift.svc.eqiad.wmnet. -- traffic might need to go cross-cluster in the event of maintenance on the eqiad thanos-swift [14:32:30] ottomata: and yeah, as usual with LVS services, you only need to add the service IP [14:33:08] ottomata: ah yes of course, what cdanis said :) [14:34:21] ottomata: thanos-swift is a little different in the sense that there's one multi-site cluster, but yes still two service IPs [14:38:08] godog: and the client only needs to talk to those IPs? [14:38:18] no parallel node uploads or anything? [14:38:44] oh c danis just said yes ok [14:39:51] oh, godog ports? [14:39:56] just 80? [14:40:10] ottomata: 443 only [14:40:12] ok [14:40:27] I'll update the task [14:40:35] just did too... [14:40:43] haha! even better [14:41:04] <_joe_> ottomata: you should add listeners to envoy, and use it to connect to thanos-swift I guess [14:41:10] <_joe_> if you're running in k8s [14:41:41] _joe_: naw this is for search to test storing flink snapshots and checkpoints in sswift [14:41:47] they 're currently running in hadoop [14:41:52] so from analytics vlan -> swift [14:42:12] https://phabricator.wikimedia.org/T246004 [17:37:33] Can someone help me sanity check something real quick? I'm trying to figure out why https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 went critical [17:37:48] it looks like in the puppet repo the code for that alert lives here: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L35-L41 [17:38:10] Yet the alert is on port `9243` whereas only `9200` is defined here: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L6 [17:39:09] That would seem to imply there's another codepath hiding somewhere, but when I rip-grep for a portion of the alert description, `ElasticSearch shard size check`, only that one file pops up [17:40:21] ryankemper: so, the [9200] in that file is only a default -- it seems likely that base_checks is being instantiated multiple places [17:40:42] Thanks, yeah I just happened to stumble across `modules/profile/manifests/elasticsearch/monitor/base_checks.pp` which appears to override it [17:41:16] (https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/profile/manifests/elasticsearch/monitor/base_checks.pp#L9-L11) [17:42:03] Okay so if this alert fired that must mean that `lookup('profile::elasticsearch::monitor::shard_size_warning'` is actually returning a value so it's not using the default instantiated later in this line: https://github.com/wikimedia/puppet/blob/9b48f28e08d50417c2029481d9fb8a0897c6f1ea/modules/profile/manifests/elasticsearch/monitor/base_checks.pp#L2 [17:42:27] Okay I think I can take it from here, thanks for taking a look [17:42:37] I see that being overridden in hiera, but, only for some of the logstash roles [17:42:49] Hmm [17:43:53] When I deployed it yesterday I re-forced checks on all the warnings that were firing at the time, and those warnings went away which told me the new warning threshold took [17:44:26] So that's why I was pretty surprised when the critical fired today...unless the logstash role is used for some of the cirrus instances too [18:01:12] ryankemper: it's not the case that puppet is disabled on those hosts, right? [18:03:16] cdanis: I'm not aware of that, but that's an interesting theory cause it could explain the behavior [18:03:25] How do I figure out which actual host ran this check? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 [18:03:35] It just lists `search.svc.eqiad.wmnet` as the host in that case [18:03:58] But that hostname doesn't expose an ssh port so not sure what the real host is [18:05:03] Would it be any host that requires `icinga::monitor::elasticsearch::base_checks` or something? [18:07:42] so, there's actually two things packed into that question, one being 'where do the queries to that virtual service IP go', and the other being, 'how does icinga know about that as a host' [18:08:00] the short answer in both cases is "via a pile of stuff instantiated in Puppet" [18:09:03] there's an... interesting connection between port numbers and various service names here [18:09:21] so the easiest way to answer the former, fully, is looking at lvs1017.eqiad.wmnet `sudo ipvsadm -L` [18:09:30] TCP search.svc.eqiad.wmnet:9243 wrr [18:09:32] -> elastic1044.eqiad.wmnet:9243 Route 10 0 0 [18:09:34] -> elastic1045.eqiad.wmnet:9243 Route 10 0 0 [18:09:36] -> elastic1053.eqiad.wmnet:9243 Route 10 0 0 [18:09:38] [...] [18:11:27] so I haven't looked at how that monitoring script is written yet, but, it likely hits a random one of those pool of machines [18:12:03] as for the second question, it's pretty common to provide a service IP name as a 'host' to icinga, and it's likely that the checks in this case are instantiated on the icinga host directly via an include [18:13:04] although I haven't actually found where that happens yet [18:16:25] it might be that all the nodes include ::profile::elasticsearch::monitor::base_checks (likely in the role), and then that creates a bunch of duplicated exported resources for icinga [18:19:11] thanks, the above was quite heplful