[09:24:04] so yeah the issue with lib/libvcc/generate.py is that if we add a patch changing its contents (like transaction_timeout) then we'll end up with NOGIT in varnishabi-strict- package name [09:24:55] instead of varnishabi-strict-$(git show -s --pretty=format:%h) [09:28:22] varnishabi-strict-$gitrevnum is used by old-school vmods that need to depend on a specific version of varnish (mbleah, we do have package versions for that!) [09:28:44] so yeah tl;dr I've added a lintian override and merged [10:18:35] pinkunicorn upgraded, I've set transaction_timeout to 0.0001 and got some nice 503s :) [10:40:28] 5.1.3-1wm2 available on apt.w.o for cache_misc upgrades [10:44:37] moritzm: ok to start LVS upgrades/reboots too? [10:56:34] yeah, sounds good. dist-upgrade should also be fine there [10:56:49] ack [12:49:23] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3744190 (10Dzahn) 05Open>03stalled [12:58:37] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3744205 (10Dzahn) @Ladsgroup Hi, i saw your IRC ping and continued working on this (see above). Though.. now we'll have to talk about the cer... [13:13:13] XioNoX: could you take a peek at some router ACL stuff for me later (not urgent): crN-eqiad has terms named phab-git-ssh to allow port 22 to git-ssh.eqiad.wikimedia.org, in the border-in4, labs-in4, border-in6 sets (seems like it should be have labs-in6 too?)... [13:13:43] XioNoX: and then relatedly, we need the same setup on crX-codfw for the git-ssh.codfw.wikimedia.org IPs (new) [14:05:17] 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3744448 (10BBlack) Well there's two different actions to get through here: First is upgrade tlsproxy hosts to `1.13.6-2+wmf1` (but still on existing `nginx-full` packages) - seamless, shouldn't require any depooling.... [14:32:32] ema: when you get a sec can you stare at https://gerrit.wikimedia.org/r/#/c/389739/ a bit and see if you can catch stupid mistakes (e.g. dumb wrong-numbering between forward+reverse, or accidental use of "ulsfo" in place of "eqsin" when copypasting, etc)? [14:33:15] bblack: sure [14:33:28] exciting patch! [14:36:23] (and yeah, lvs5 don't have ipv6 for their main hostnames. this is just matching existing practice, separate issue to fix up globally at some point) [14:37:51] bblack: https://gerrit.wikimedia.org/r/#/c/389739/3/geo-maps line 214 [14:38:02] shouldn't that line list eqsin first? [14:38:56] heh yeah I just pushed a diff for that as you were saying it :) [14:39:07] nice :) [14:51:17] always fascinating to stare at the geoip file [14:51:33] VN => [ulsfo, codfw, eqiad, [-esams],-]{+esams, eqsin],+} # Viet Nam [14:52:06] is codfw really to be preferred to esams? [14:52:30] we've never gotten into that level of detail I don't think (but I could be wrong) [14:52:42] as in, I think everything with ulsfo-first has codfw-second [14:53:00] oh ok [14:53:18] ideally we should be mapping that out per-country based on latency [14:53:28] some would be [ulsfo, codfw, ...] and some [ulsfo, esams, ...] and such [14:54:02] or we could get rid of most of the map except where overrides are warranted and let the automatic-distance-mapping mode do it for us [14:54:24] one of those back-burner things that doesn't have a ticket is to compare how close at least our first choice is to the auto mode [14:55:02] as the complexity of this rises, I think auto-mode (perhaps with some enhancements?) is eventually going to become necessary [14:55:14] it gets really hard to manually calculate an optimal map for all countries + 5x DCs [14:55:20] (it's not easy with even 4) [14:57:25] but distance is probably a reasonable loose proxy for latency anyways, except for some edge cases [14:58:04] for the long long term, we really need something closer to https://phabricator.wikimedia.org/T94697 [15:02:33] and when they announce new cables it makes it even more complicated, like yesterday's announcement that can be summarized in this image: https://techcentral.co.za/wp-content/uploads/2017/11/seaborn-1078.jpg [15:05:33] bblack: patch looks reasonable! [15:08:52] thanks! [15:10:50] woot, now "ping cp5001.eqsin.wmnet" fails in a completely different way than it did yesterday! [15:12:13] :) [15:19:07] 10Traffic, 10Operations, 10Phabricator: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3744710 (10ema) [15:19:16] 10Traffic, 10Operations, 10Phabricator: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3744725 (10ema) p:05Triage>03Normal [15:21:54] 10Traffic, 10Operations, 10Patch-For-Review: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3744763 (10BBlack) Status updates? >>! In T156256#3699583, @faidon wrote: > Things pending: > - RPKI, @ayounsi has sent the extra ToS to legal for review and they said they m... [15:27:32] bblack: RPKI is done and just pinged mark about his account [16:02:24] 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3744898 (10faidon) [16:02:27] 10Traffic, 10Operations, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3744899 (10faidon) [16:02:30] 10Traffic, 10Operations, 10Patch-For-Review: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3744896 (10faidon) 05Open>03Resolved RPKI is all done as far as I know. @mark said he'll create his account later, if at all. I think we can resolve. [16:10:41] all LVSs upgraded and rebooted [16:10:46] moritzm: ^ [16:12:42] \o/ [16:15:06] ~30 cache nodes left, I'm going through text/upload interleaved with ~10 minutes spacing [16:15:12] a bit slow maybe as an approach :) [16:15:23] great! [16:16:22] out of curiosity what do you do for rolling restarts? [16:16:30] (collective you) [16:16:51] I can't tell you in public because vo.lans is listening [16:17:05] lol [16:17:29] we do scary hacky things that cut corners, for the most part [16:17:54] eheheh [16:17:57] see? [16:18:06] :-P [16:18:27] heheh I'm sure it is somewhat better than what I'm doing, namely ssh -> reboot [16:18:46] godog: well, it's that but in a while loop [16:18:51] so mostly is my fault, I need to find the time to work on the "spinoff" from switcdc, to convert into a piece of cake a bunch of call to conftool and cumin and such [16:18:54] I know with salt at one point I developed a habit of using "at" for the reboot part, so that it didn't fail/hang salt [16:19:17] I think last time around I still did that with cumin, but probably not necc [16:19:43] as in: "echo reboot | at now + 1 minute" [16:19:48] godog: you can also use cumin with batch 1 sleep as needed and using the 'reboot-host' script [16:19:51] https://wikitech.wikimedia.org/wiki/Cumin#Reboot [16:20:00] to detach and postpone the actual reboot from the ssh command execution -> connection-end [16:20:26] the reboot-host does: nohup reboot &> /dev/null & exit [16:20:36] neat thanks bblack / volans [16:20:40] that is basically the same but without the need to wait 1 min ;) [16:21:08] yeah, assuming you win the race, but I think most hosts can't shut down faster than the shell can exit :) [16:21:08] I personally have a script that sets icinga downtime, does the needful on the box, reboots it and then waits for the host to come back online before exiting [16:21:44] we also have some depool->repool magic, but it has bitten us before, too, I think it was probably ill-considered [16:22:02] * volans will try harder to find the time for that spinoff... is really needed [16:22:36] basically we have a custom systemd service unit file called "traffic-pool", the core of which does: [16:22:39] [Service] [16:22:41] Type=oneshot [16:22:44] RemainAfterExit=yes [16:22:46] ExecStart=/bin/sh -c 'if test -f /var/lib/traffic-pool/pool-once; then rm -f /var/lib/traffic-pool/pool-once; sleep 45; /usr/local/bin/pool; fi' [16:22:49] ExecStop=/usr/local/bin/depool ; /bin/sleep 45 [16:23:09] and that service has an After= dependency on the nginx + varnish services. [16:23:26] so on shutdown, it's automatic that the host will depool itself and sleep 45s before stopping nginx+varnishd [16:23:51] and on startup, it will auto-repool itself if you touched /var/lib/traffic-pool/pool-once before shutdown [16:24:24] that latter bit... seemed like a nice idea to wrap simple reboots back when we did it [16:24:45] but it can fail spectacularly, because sometimes hosts don't come back from reboot and have a hardware issue [16:25:19] so the host you meant to quickly-reboot is now dead for days/weeks until hardware is replaced, at which point it self-auto-repooled immediately on startup, but has been missing its critical cronjobs and puppet runs ever since and thus is likely in a broken state :/ [16:25:46] it could check the mtime of the /var/lib/traffic-pool/pool-once file [16:25:52] yeah [16:25:56] if older than 1h skip [16:26:25] well, assuming the time is sane, right? [16:26:30] but yeah putting all this logic into the systemd file can lead to funny behaviours [16:26:34] ofc! :D [16:26:38] if the mainboard was replaced it might not be until some ntp clock-step we're not depending on or something [16:28:26] I can see the incident report now: "replacement mainboard arrived two weeks later with the hardware clock set two weeks behind realtime, thus bringing it into service with expired OCSP stapling" [16:29:13] it's probably better just to have pooling always be a manual step [16:29:29] well "manual" as in not from systemd/cron/whatever. maybe automated from cumin :) [16:31:16] lol. nice failure [16:33:36] adding higher level criteria for healthchecks might help too with safer rolling reboots [16:34:45] we have those at the icinga level, but there's no safety-valve for "icinga checks must come back ok before repooling" other than human eyes [16:35:16] but the stuff that looks at our TLS termination per-cache-host also checks things like OCSP Stapling freshness (one of the fallouts of cron not running forever) [16:36:17] SSL OK - OCSP staple validity for en.wikipedia.org has 319690 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 378 days) [16:36:22] e.g. ^ [16:37:53] I think there are probably fundamental problems with that whole pattern of how we monitor and feedback these cases, but it can be hard to move away from the patterns that industry tooling is expecting to operate under [16:38:01] (if that makes any sense heh) [16:39:12] like, I really think puppetization should automatically include deploying some health-checking of everything it deploys. Some of it automatically-generated checks based on the files/services/etc deployed. Some of it just a differently factored version of how we coupld services + icinga checks puppetization today. [16:39:38] but all the per-host checks that are running via icinga, they should all run locally, independently of icinga. [16:40:16] icinga should just be monitoring the global public states (e.g. checks on https://en.wikipedia.org/, and doing one NRPE check per server that pulls back the overall OK-or-not stats of all that detailed local checking) [16:40:36] which can also just be run locally to see what's wrong, and can be hooked into local dependencies. [16:41:07] e.g. if the health-checking of sane varnishd behavior on port 3128 doesn't work, that should be a dependency for nginx service even getting to start [16:41:20] (because the puppetization says nginx depends on varnish in this case) [16:41:28] and ditto for automatic pooling/depooling, etc [16:42:21] (but without having icinga or anything remote in the feedback loop here) [16:42:37] heh in the case above I was thinking of checks as performed by pybal, but indeed it'd be the same logic [16:43:59] the main thing going against icinga's grain I could spot is that a single nrpe/service per host means downtime/silencing can happen only on that single service [16:44:28] yeah the silence/ack stuff would have to be Elsewhere [16:44:42] in whatever this host-local-level monitoring/health/feedback system is [16:45:08] the global monitoring that happens from a remote host just gets a summary state per-host, and does external-viewpoint checks [16:45:19] (of whole services, not individual nodes) [16:45:36] but it's a big break from standard patterns [16:45:50] systemd isn't well-equipped for it, and there's no real infrastructure or design for this kind of thing that I'm aware of. [16:46:46] the pybal-vs-icinga thing is an interesting intersection of it all [16:47:09] we have basic checks in pybal. they could check a lot more, or they could rely on this same host-level check->output for most of it. I'm not sure where you draw the line there. [16:49:06] yeah pulling info from prometheus [16:49:34] tl;dr of this half-baked idea is that there should be a local-health-check infrastructure on hosts. one that can be remotely queried for overall state (like a one-per-host NRPE check) by things like icinga and pybal. but also decomposes locally so that individual lower-level healthcheck are integrated into the service system (e.g. systemd dependencies). [16:49:55] and ack/silence/whatever can be controlled locally and affects that whole-host-level output back to upstream monitors like pybal/icinga [16:50:54] and puppetization should mostly-automagically do a lot of the basic healthcheck configuration there. if puppet deploys a thing, it should be an automatic consequence that the host also monitors that thing at least to some basic degree (and able to customize for deeper checks) [16:51:16] * volans in a meeting, reading backlog in a bit [16:51:52] mark: intriguing! how'd that work? [16:51:56] the edge-case is where you draw the line on e.g. pybal trusting the host's own status of itself. [16:52:32] the host may say it's all-ok on its local checks, but a ferm rule prevents pybal/LVS from actually being able to reach the service supposedly being offered on the network. [16:53:29] yeah for that we'd be hitting depool threshold [16:53:41] I'd tend to think you just validate basic service-port-reachable on top of querying the host-local state. but maybe that leaves a lot of other unexplored edge-cases where a local check of service on the remote-facing IP vs actual-remote-check results can differ, I donno. [16:55:39] for that matter, that reminds me that even today our pybal service healthchecks aren't actually testing the thing they should be heh [16:55:50] it's something I think about occasionally and then bury my head in the sand and forget [16:56:31] since we're speculating, the pybal healthcheck talks to the service port and e.g. on /healthcheck we're serving the health of whatever is relevant, dependent services etc [16:56:32] godog: haven't thought it through yet but seems possible i'd say? [16:57:55] mark: yup seems definitely possible [16:59:54] sorry I got interrupted a sec there in my rambling.. re "aren't actually testing the thing they should be": [17:00:30] we're checking from pybal that e.g. the text-lb service responds on cp1065.eqiad.wmnet's IP, but the actual forwarded traffic is to the LVS service IP, on cp1065's macaddr [17:00:55] so if the LVS service IP was missing in the local address set on cp1065, the healthcheck would succeed but forwarded traffic from LVS would fail [17:04:43] bblack: if you want there is always that small thing I wrote a while ago for testing the LVS vip [17:04:51] ;) [17:04:56] hehe [17:05:32] well what we really want, is for most of pybal's checks to have an option to use the VIP as the destination, which is tricky... [17:05:47] I'm not even sure, with the "right" code in pybal, if it can even be made to work in our current setup [17:06:23] in the example scenario, lvs1001 and cp1065 both have 208.80.154.224 configured on their loopback interfaces [17:06:35] that is what my small thing does, TCP and "very simple" HTTP checks on the VIP [17:06:59] and we want pybal to test over a TCP connection to cp1065, by explicity connecting to 208.80.154.224 using cp1065.eqiad.wmnet's macaddr from ARP as the destination MAC at the ethernet level [17:07:02] could be used as an external check from pybal point of view and ofc as a secondary check [17:07:20] although is IPv4 only... [17:07:46] was done as holiday project during christmas/new year's eve :D [17:08:00] I just don't know if it's possible to do that from the LVS host, as it will be rather confusing to the LVS machine's network stack that the IP exists locally, too. [17:08:16] where's the tool? [17:08:59] * volans add a bunch of disclaimers about he's C code... [17:09:01] https://github.com/volans-/raw-socket-checkers [17:09:12] * volans not a C developer ;) [17:10:06] nice! [17:10:26] yeah with raw sockets you can [17:10:36] so basically, ideally we port the same concepts into pybal checks in python. [17:10:46] twisted may already have support for that [17:10:48] probably as an optional thing (since pybal could in theory be used with a non-DR service too) [17:10:49] there's "twisted pair" [17:10:59] for low level network stuff [17:11:02] (nice name that) [17:11:21] and we'd want it to work for all our checks, like IdleConnection and HTTP and blah blah [17:13:25] if you want a bit more details those are documented in the various checks, like https://github.com/volans-/raw-socket-checkers/blob/master/doc/check_tcp_raw.md [17:14:04] looks pretty good :) [17:14:39] can be used straightaway with RunCommand of course [17:39:40] bblack: to follow up on your question about git-ssh; I'll add the missing term in labs-in6. Do we need to have both git-ssh.eqiad and git-ssh.codfw in all terms phab-git-ssh; or only git-ssh.eqiad in eqiad and git-ssh.codfw in codfw ? [17:41:17] bblack: note that I'd recommend the 2nd option to keep the border-in rules generic [17:42:10] XioNoX: I think the 2nd option [17:42:42] bblack: actually, my bad, I meant the 1st option to keep them generic [17:44:53] as the border-in firewalls filters are shared between all the cr* [17:47:32] ah [17:47:38] well, that works too :) [17:47:59] (and probably better in the long run in case we end up advertising a range from somewhere unexpected to work around some issue) [17:49:25] indeed [18:08:36] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3745262 (10greg) 05Open>03Resolved a:03greg Created with some stub content (Title and sub-title). Please make them be... [18:10:12] bblack: you should be all set [18:11:17] XioNoX: thanks! [18:47:11] 10Traffic, 10Operations, 10Patch-For-Review: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3745560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs4005.ulsfo.wmnet', 'lvs4006.ulsfo.wmnet', 'lvs4007.uls... [18:53:32] 10Traffic, 10Operations, 10Pybal: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069#3745584 (10BBlack) [18:58:02] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3745604 (10BBlack) a:05BBlack>03RobH @RobH - the hostnames for these should be dns4001 + dns4002. We won't be running ganeti when we initially bring these into service, so should have standard no-vi... [18:59:09] 10Traffic, 10Operations: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#3745606 (10BBlack) [19:01:20] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3745610 (10BBlack) @RobH - also, we should go stretch from the get-go on these as well (like bast4) [19:20:28] 10Traffic, 10Operations, 10Patch-For-Review: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3745639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4005.ulsfo.wmnet', 'lvs4006.ulsfo.wmnet', 'lvs4007.ulsfo.wmnet'] ``` and were **ALL** successful. [21:01:36] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3745895 (10ayounsi) The `Destination unreachable (Host unreachable)` packets are most likely due to firewalls or middle boxes on the client side that have... [21:06:40] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf721[56] - https://phabricator.wikimedia.org/T179204#3745899 (10RobH) [22:11:47] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy wmf721[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) [22:16:30] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy bast400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3746075 (10RobH)