[08:47:01] vgutierrez, it turns out I have 5 different installs of cryptography on my system - 3 from pip (2.7, 3.5, 3.6) and 2 from apt (2.7 and 3.x). wonder if that caused that weird import issue [08:47:56] dunno, but still.. so far we only need the ocsp part of x509 in ocsp.py [08:47:56] so.. [08:48:03] it doesn't hurt :) [08:48:31] yeah [09:53:23] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, 10Patch-For-Review: Sundown aliases `minnan` and `zh-cfr` for `nan`/`zh-min-nan` - https://phabricator.wikimedia.org/T230382 (10Peachey88) Has there been any checks on the usage of these aliases yet? [10:31:26] hi traffic could someone familure with the lvs::monitoring class review https://gerrit.wikimedia.org/r/c/operations/puppet/+/529362 [10:31:36] how i got to this conclusion is describe in https://phabricator.wikimedia.org/T229621#5405106 [14:18:20] that seems like an odd way to fix the monitoring, but maybe it's the only way? [14:18:50] I also wonder what effect all the duplicate addresses will have on the rest of the config that derives stuff from that data (pybal config, and loopback interface config on LVS and the service hosts) [14:20:51] yeah looking at the last PCC output on the patch, it seems to multiply out all the defined services in pybal [14:21:02] (all the services are defined for all of the other services' ports) [14:27:26] vgutierrez, bblack: can you spare some few min to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/529362 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/529053 [14:27:33] ? [14:29:19] onimisionipe: see my comments just above - I don't think that approach is going to work right for the functional (non-monitoring) config. [14:30:24] Oh.. I didn't follow before. sorry [14:31:39] there was another way, but that involved some hacks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/528885 [14:33:00] remind me, because I've been away almost a whole week and don't recall where we left things or know what may have changed since... [14:33:17] with what's currently already deployed, what are the actual issues? [14:34:33] the issue is cloudelastic is configured with IPv4 and IPv6 and exposes six(6) different ports which are all on/via LVS and there's need for monitoring these ports [14:36:10] our lvs::monitor_http_https have some trouble handling this case as it deviates from the normal scenarios [14:36:20] does it ring a bell? [14:41:49] bblack: thanks for the review, i was worried that it would conflict with some other code path :( [14:42:33] so, reviewing mentally out loud here: we should end up with two kinds of monitoring I assume: [14:43:02] 1) icinga checks of the LVS IP on all of the ports for all of the 6 service variants, for IPv4 and IPv6 (12 total checks) [14:43:22] 2) icinga checks on the cloudelastic100[1-4] nodes for all of the ports, but on their own per-machine IPv4 and IPv6 [14:44:25] I'm guessing (2) either already worked easily or we haven't addressed it yet, and I think for the (1) at least initially we ended up with only icinga checking IPv4 and 1/6 ports heh. [14:46:12] checking up on (2) right quick because it seems like the simpler problem [14:47:07] yeah (2) seems configured correctly in icinga [14:47:08] I don't think we have two in place [14:47:13] it's there [14:48:21] I only see SSL checks on those ports [14:48:22] and (1) seems to still be in the state I think I last saw it: there's only a single check defined for IPv4 and 1/6 ports, and on top of that the IPv4 it's trying to contact is the one belonging to icinga1001 machine heh [14:48:44] yes, SSL checks are what we need, right? [14:49:32] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudelastic1003 [14:50:15] (those report the certificate status in the detailed info, but they get that info by connecting to the appropriate port, so they're checking the protocl level as well) [14:50:37] close but not really. the LVS checks would guarantee HTTPS ports are Ok [14:50:47] yes.. correct [14:51:18] that's all I'm really looking for in both cases, is the HTTPS check I think [14:53:29] here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/tlsproxy.pp#42 [14:55:01] maybe we should add the check port there as well [14:55:06] it is there [14:55:15] oh to the title you mean? [14:55:31] not the title, that check is check_ssl_on_port [14:55:50] check_command => "${check_command}!${server_name}!${tls_port}", [14:56:05] it's there, I was looking at the final output in icinga1001's config files too, they really do monitor the right ports [14:56:14] Ok [14:56:57] I have a meeting for the next hour, but I'll jump back on this afterwards and see what I can figure out [14:57:06] the check does not check service (it kinda does but that's not the intention) and I bet the frequency is low [14:57:10] Ok [14:57:53] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528885 seems like it's probably cleaner (given all the general problems in LVS monitoring puppetization as a background), assuming it actually works as intended. [14:58:05] will poke at it when I get back [14:59:22] frequency is perfect! [15:00:17] Ok. Thanks! [15:11:22] ok my 1 hour meeting slot ended quickly! [15:12:00] onimisionipe: ok so frequency on the host level checks is ok? [15:13:13] yep [15:13:40] frequency is Ok for the ssl checks. It matches frequency of checks for LVS [15:18:38] gehel: while I agree that the approach isn't ideal in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528885/ , I think this is the closest to a sane factoring we can get without first fixing deep stuff in the LVS puppetization itself, which is likely to be a perilous process given all the other things relying on the current hacky crap. [15:19:11] gehel: if you're ok with it, can you pull your C-2 and we can give it a try and see if it makes the monitoring checks look sane in practice? [15:24:37] bblack: jbond42 had soemthing that looked like a better solution, lemme find the CR [15:24:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/529362 [15:25:34] some writing on what he investigated: https://phabricator.wikimedia.org/T229621#5405106 [15:26:10] gehel: yeah I've seen it, but it creates problems for all the non-monitoring stuff those IPs define (e.g. it creates 72 new pybal-configured LVS services that shouldn't exist on the LVS hosts). [15:26:18] gehel: it seems that my CR is not compatible with how other parts of puppet use that config [15:26:36] yeah, that's the part I wasn't sure about, all those changes in pybal.conf [15:27:03] Ok, I removed my +2 [15:27:14] want me to merge it and see what breaks? [15:27:45] gehel: I can take care of it, I'm already staring at all the related bits [15:27:53] bblack: thanks! [15:28:04] I just didn't want to unilaterally override a -2 :) [15:28:19] he, he, he... [15:28:23] :) [15:28:54] * onimisionipe crosses his fingers [15:35:00] bblack: fyi, I'm going to depool eqsin for ~1h max (cr2-eqsin upgrade) - https://gerrit.wikimedia.org/r/c/operations/dns/+/529966 [15:35:07] XioNoX: ack, ok [15:41:02] onimisionipe: seems to have improved the situation at least. We now have 6 monitored services on cloudelastic.wikimedia.org in icina, using the right IPv3 and the right port numbers. [15:41:36] onimisionipe: but, of course, it's Ipv4-only. We should probably add IPv6 checks to the same place and make it even uglier. [15:42:10] lol [15:42:21] I think IPv4 is ok for now [15:42:35] ok [15:43:30] if you want to go down that road, what we did for other public services was basically make another nagios monitoring::host definition for a name like cloudelastic.wikimedia.org_ipv6 with the v6 IP and attach all the same services to that as well [15:44:51] the new checks came up green! [15:44:54] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudelastic.wikimedia.org [15:53:40] I'm not sure I want to [15:54:03] I'll leave this for now and see if there's a need for IPv6 (my opinion) [15:54:10] and yay! to green checks [16:21:54] 10Traffic, 10MediaWiki-API, 10Operations, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Anomie) There's nothing to do with #MediaWiki-API here, or with MediaWiki at all. Any request to wikidata.org is served a 30... [18:59:31] 10Traffic, 10Operations, 10SRE-Access-Requests: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10CDanis) Is this done? [19:12:19] 10Traffic, 10Operations, 10SRE-Access-Requests: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10BBlack) 05Open→03Resolved Looks like it to me :) [19:35:46] ^ great :) [19:45:17] XioNoX: wanted to call out this planned maintenance in AMS that asks we disable lasers before work begins -- https://groups.google.com/a/wikimedia.org/forum/#!topic/ops-maintenance/vEs6fPW04ms [19:45:22] XioNoX: should I file a ticket about that? [19:47:30] cdanis: yeah go ahead, it means cutting 2 of our 3 links to knams, so I think we should stop advertizing our prefixes from there during the work [19:47:44] k will do, and will assign to you [19:47:52] thanks! [19:54:55] I also noticed we have overlapping maintenances on different links from Zayo and from Centurylink: https://groups.google.com/a/wikimedia.org/d/msg/ops-maintenance/6Kl-_ml5gXc/VJvR9bGHEgAJ vs https://groups.google.com/a/wikimedia.org/d/msg/ops-maintenance/q2wgFNt3-U0/eRcQ1avoFQAJ [19:55:22] the Centurylink one is an IAD/AMS link, not sure about the Zayo one except it looks DFW-ish [20:01:54] also XioNoX did you see this link down report from Eqinix DA1? https://groups.google.com/a/wikimedia.org/d/msg/ops-maintenance/jGuPVjKzW1M/Fk_Q5H4fFgAJ [20:03:09] oh, both of these must be your cr maintenances this morning [20:03:17] nevermind :) [20:05:16] cdanis: yeah the 2nd one is eqiad-codfw so the overlap isn't an issue [20:05:44] that was my suspicion -- thanks! [20:08:27] 10Traffic, 10netops, 10Operations: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis) [20:37:16] I'd create a google calendar event for it if my google calendar could load :) [20:37:43] I tried to reboot, change browsers and change connection, no luck.. [20:41:04] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) [20:41:32] 10Traffic, 10netops, 10Operations: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) That seems to actually be one circuit terminating in two ports on each sides: cr2-knams:xe-0/0/3 to asw-esams:xe-0... [20:42:06] ugh [20:42:14] your storage backend might just be hosed :\ [20:42:20] OIT could maybe file a ticket on your behalf [20:42:27] btw there's a calendar event in the maintenance calendar [20:44:45] yeah I'll reach out to OIT [20:45:15] I added an extra event in my calendar to remember to disable the laser 2h before the work [20:46:29] 10Traffic, 10netops, 10Operations: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) [21:23:27] 10Traffic, 10netops, 10Operations: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis)