[08:09:03] _joe_: morning, this is your friendly reminder to update nginx-full in conf* as requested on Friday ;) [08:09:24] <_joe_> vgutierrez: give me 10 minutes [08:09:28] sure :) [08:13:04] <_joe_> ok, done [08:13:09] <_joe_> so let's start in codfw [08:15:19] <_joe_> vgutierrez: do you have a ticket for it? [08:16:58] one sec [08:17:15] is a requirement for T164456 [08:17:16] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [08:17:31] <_joe_> ok good [08:18:26] <_joe_> doing conf2003 [08:18:39] \o/ [08:18:56] <_joe_> not doing it "safe" as I want us to check the failure modes of all our applications [08:19:38] <_joe_> we're relying more and more on etcd, I want our systems to be loosely coupled as much as possible [08:19:55] <_joe_> vgutierrez: you will need to check pybal once I'm done [08:20:04] sure [08:20:08] <_joe_> doing conf2002, this might page [08:21:43] <_joe_> uhm we're doing the numactl command on the conf* servers too for nginx, meh [08:22:26] regarding pybal, ema did an awesome job with etcd reconnections and monitoring on icinga [08:24:31] <_joe_> yeah the solution to the issue was quite simple [08:24:42] <_joe_> ok, codfw fully done [08:24:43] hmmm you're not at conf2001 [08:24:44] <_joe_> now eqiad [08:24:51] <_joe_> ? [08:24:55] s/not/now/ [08:24:59] <_joe_> yes [08:25:08] <_joe_> just finished from a few minutes [08:25:16] I'm getting some noise from conf2001 on pybal logs [08:25:50] <_joe_> try restarting pybal [08:25:55] <_joe_> on one machine [08:26:02] <_joe_> I'm pretty sure it's a pybal bug [08:26:50] it's logging a lot of 400 errors [08:28:14] <_joe_> vgutierrez: it's another pybal bug, I think we fixed the first occurrence of that [08:28:21] lol [08:28:40] <_joe_> well I should say it's a twisted bug actually [08:28:45] <_joe_> as the other one fixed by ema was [08:28:47] <_joe_> sigh [08:29:02] <_joe_> btw, cp5006.eqsin.wmnet is unreachable according to pybal [08:29:12] yep [08:29:19] hey [08:29:23] * mark reads pybal/kubernetes ticket [08:29:27] that would be nice [08:29:46] i went off early on friday afternoon as I had a headache and got blurry vision ( [08:29:47] :( [08:35:02] ouch [08:43:40] <_joe_> vgutierrez: we should go with eqiad, once pybals are ok [08:43:52] so it looks like we got ourselves more work on pybal's etcd driver [08:43:58] happy times [08:44:43] <_joe_> vgutierrez: I think moving to treq or - even better - manage etcd with an external daemon that talks with pybal via IPC would be the way to go [08:44:55] <_joe_> treq is the "requests for twisted" [08:45:47] pybal is green again [08:45:52] <_joe_> I started working on that but then just gave up for lack of time https://gerrit.wikimedia.org/r/#/c/365549/1/pybal/etcd.py [08:45:57] let's go for eqiad [08:46:03] <_joe_> ok [08:46:14] <_joe_> we have an additional twist there [08:46:19] <_joe_> mediawiki is using etcd [08:46:24] I like the external (non-twisted) daemon idea [08:47:04] <_joe_> ema: that would be either confd or (better) a simple wrapper written with python-etcd or the go client [08:47:42] <_joe_> but this still leaves you with a problem, I don't think it's a magic bullet [08:48:06] <_joe_> because then you have to make any IPC system sustain the other service restarts [08:49:01] <_joe_> vgutierrez: starting with conf1003 [08:49:23] ack [08:50:47] why the extra complexity of an additional daemon instead of just fixing the bug(s)? [08:51:38] pybal failing on eqiad and esams right now [08:51:52] <_joe_> vgutierrez: I restarted nginx on conf1003 [08:51:55] ack [08:51:55] so what is the issue? do we have a ticket? [08:52:11] I'm going to restart pybal on esams [08:52:26] <_joe_> mark: because that will allow us to use the bug-free non-twisted libraries to connect to etcd, for example [08:52:37] <_joe_> vgutierrez: let's wait for the dust to settle? [08:52:46] "bug-free non-twisted libraries"? [08:52:51] <_joe_> we will break more things with the rest of the restart [08:53:41] <_joe_> doing 1002 now [08:53:51] _joe_: esams pybal in configured to fetch data from conf1003 [08:53:55] and eqiad from conf1001 [08:53:58] <_joe_> ok [08:54:01] <_joe_> ack [08:54:03] mark: I think this is yet another previously unknown issue we're dealing with today [08:54:17] yup.. ema put some efforts on it [08:54:17] <_joe_> ema: yes, the 400 bad request upon reconnection [08:54:17] so what is the symptom? nginx restart and pybal doesn't reconnect? [08:54:31] pybal tries to reconnect [08:54:38] I was seeing a lot of 400 errors [08:54:49] so somehow it's sending bad http requests [08:55:12] apparently that's the case [08:55:36] <_joe_> check the nginx logs, they might tell you more [08:55:52] <_joe_> doing 1002 now, it might make etcdmirror page [08:56:09] <_joe_> but I want to test what happens, again [08:56:10] yup.. I'll hit the logs and re(open)/update the phab task [08:56:34] T169765 [08:56:34] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [08:56:37] mark: an external non-twisted daemon would be considerably *less* complex to work with imho, I don't think it would add complexity at all [08:57:37] especially considering how much time we've spent debugging the twisted etcd driver already (and yet it's unstable to say the least) [08:58:03] but at least ema's work monitoring pybal<-->etcd connections it's working, so even with pybal log being quiet we know that some stuff is broken :) [08:58:26] (quiet after throwing a lot of 400 errors on the etcd driver) [08:59:03] <_joe_> ok, vgutierrez I'm going to restart 1001 [08:59:10] and can't we use any etcd client libs in pybal itself? [08:59:20] non-twisted [08:59:22] _joe_: hit it :D [08:59:33] <_joe_> mark: those are blocking, so you need to do DeferToThread, and that has some issues [08:59:42] <_joe_> I'll elaborate once I'm done with the operations [08:59:47] or asyncio soon maybe [08:59:50] sure [09:01:22] <_joe_> vgutierrez: uhm 1001 has some issues [09:01:31] <_joe_> ok it finally restarted nginx [09:01:43] yup.. I saw that [09:01:50] pybal log looks way better on eqiad cluster [09:02:10] we got some connections refused and after that reconnected as expected [09:02:50] I'm going to reschedule the icinga check on lvs100* [09:02:54] <_joe_> vgutierrez: I'm done [09:02:58] but it looks good [09:03:08] some how the slow restart played good for pybal [09:03:59] <_joe_> looks like it was the case :D [09:04:07] yup [09:04:23] <_joe_> ok, taking a brief break now as I've got to start the rolling restart of memcacheds afterwards [09:04:48] hmmm or not [09:04:55] CRITICAL: 32 connections established with conf1001.eqiad.wmnet:2379 (min=40) [09:05:59] use httpry to see what it's doing [09:07:34] ah https [09:07:51] damn security guys! :P [09:09:19] so on eqiad only lvs1006 was partially affected [09:11:38] 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3234359 (10Vgutierrez) Current picture: ```vgutierrez@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"' 366 hosts will be targeted: conf[2001-2003].codf... [09:12:47] <_joe_> well if the request makes it to the backend, you can watch port 2378 with httpry [09:13:53] vgutierrez: pro-tip C:tlsproxy::instance == R:class = "tlsproxy::instance" ;) [09:14:08] OMG :) [09:14:39] volans: I've copy&pasted bblack and we know that bblack is always right.. :* [09:15:00] rotfl! [09:15:29] I didn't say is wrong, just there is a shorter version ;) [09:15:36] hmmm so pybal etcd throws a several 400 errors till it throws an "Unhandled error" [09:15:39] Apr 23 08:25:36 lvs2004 pybal[10076]: Unhandled Error [09:15:43] and then it goes silent [09:15:47] till you restart it [09:18:08] 10.20.0.11 - - [23/Apr/2018:08:53:15 +0000] "GET /v2/keys/conftool/v1/pools/esams/cache_text/nginx/?waitIndex=161172&recursive=true&wait=true HTTP/1.0" 400 163 "-" "PyBal/1.6" [09:18:12] 10.20.0.11 - - [23/Apr/2018:08:53:20 +0000] "GET /v2/keys/conftool/v1/pools/esams/cache_text/varnish-fe/?recursive=true HTTP/1.0" 200 1481 "-" "PyBal/1.6" [09:18:21] 400 VS 200 request on conf1003 [09:19:14] 10.20.0.11 - - [23/Apr/2018:08:53:20 +0000] "GET /v2/keys/conftool/v1/pools/esams/cache_text/nginx/?recursive=true HTTP/1.0" 200 1436 "-" "PyBal/1.6" [09:19:34] that one even better... querying the same pool that on the 400 sample [09:20:48] it seems that pybal's etcd reimplements half an HTTP client [09:21:26] is the etcd protocol so unusual that that's really necessary? [09:23:18] so... fetching and outdated index results in an HTTP 400 response [09:23:33] curl -0 -v "https://conf1003.eqiad.wmnet:2379/v2/keys/conftool/v1/pools/esams/cache_text/nginx/?waitIndex=161172&recursive=true&wait=true" [09:23:40] results in [09:23:41] HTTP/1.1 400 Bad Request [09:23:48] and the following body [09:23:50] {"errorCode":401,"message":"The event in requested index is outdated and cleared","cause":"the requested history has been cleared [171395/161172]","index":172394} [09:28:07] maybe we should rewrite that code to use more of twisted's higher level http client classes [09:28:15] (or treq) [09:28:48] maybe, but that's nothing to do with the current issue [09:29:09] on the HTTP side of things everything looks ok [09:29:12] <_joe_> vgutierrez: that means for some reasons it's requesting an index that's too old, I think I fixed that bug [09:29:33] <_joe_> vgutierrez: the problem is we should reset the waitIndex when losing connection [09:29:43] right [09:29:53] <_joe_> I thought I did that in the past [09:30:00] but now ema added reconnect code [09:30:08] so not necessarily all state is reinitialized [09:30:46] here's the reconnect patch for reference https://github.com/wikimedia/PyBal/commit/c917f59f4c37c7b89f15d0fd8b375de028d4a90b [09:30:47] the problem is what we consider a clean close [09:30:51] Apr 23 08:53:14 lvs3001 pybal[23676]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): : 400 Bad Request [09:30:55] Apr 23 08:53:14 lvs3001 pybal[23676]: ] [09:30:57] Apr 23 08:53:14 lvs3001 pybal[23676]: [config-etcd] INFO: client connection closed cleanly [09:31:19] and check [09:31:21] https://github.com/wikimedia/PyBal/blob/master/pybal/etcd.py#L152-L155 [09:31:46] mainly, line 155 [09:32:06] 4 spaces less and it's fixed O:) [09:32:20] <_joe_> vgutierrez: right, we should always reset the wait index [09:32:27] <_joe_> because it works like this: [09:32:39] <_joe_> pybal gets an update at waitIndex N [09:32:49] <_joe_> it starts watching from waitIndex N+1 [09:33:17] <_joe_> when it loses connection (after a long time watching) for any reason, the event at waitIndex N+1 can be gone from the current log [09:33:36] <_joe_> so etcd is not able to rely all the events since then and throws a 400 [09:34:03] could also reset it on 4xx? [09:34:18] <_joe_> just on 400, yes it shold [09:34:22] <_joe_> *should [09:34:42] 401 according to vgutierrez above [09:34:58] <_joe_> 401 is the error code [09:34:58] note that 401 is not the http status code, rather the internal etcd error [09:35:02] <_joe_> not the http status code [09:35:04] ok [09:35:08] that isn't confusing at all [09:35:14] not at all :) [09:35:16] <_joe_> no :P [09:35:20] * vgutierrez confused [09:35:33] so as mark suggests [09:35:35] https://github.com/wikimedia/PyBal/blob/master/pybal/etcd.py#L188-L189 [09:35:37] EcodeEventIndexCleared = 401 [09:35:57] checking the reason is a 400 Bad Request and clearing waitIndex there should do the trick [09:36:04] * vgutierrez submitting patch [09:36:11] <_joe_> vgutierrez: or looking at the content of the error response [09:36:20] <_joe_> and looking for error code 401 [09:36:24] <_joe_> as python-etcd does [09:36:29] ack [09:37:51] <_joe_> mark: circling back to "why not use python-etcd in pybal" - I tried to do that with etcdmirror and it has some serious downsides. You have to use deferred.deferToThread if you don't want to block, but twisted then cannot in any way kill that thread if it needs to stop, so when you want to restart pybal you're stuck waiting for that thread to terminate [09:38:16] ok [09:38:29] <_joe_> that's ok with etcdmirror, but would be a disaster for pybal [09:38:57] <_joe_> let me rephrase: I couldn't find a way to make twisted kill that thread [09:39:01] <_joe_> maybe there is one [09:39:02] well I think we can make this code reliable anyway, it doesn't look that complicated [09:39:22] why are you not using one of the higher level http client classes? [09:39:54] <_joe_> mark: no specific reason I know of [09:40:04] <_joe_> that's just how or.i implemented that [09:40:14] oh thought you wrote it, sorry [09:40:19] <_joe_> nope [09:40:49] <_joe_> I've just been one of the people who fixed bugs from time to time :) [09:40:54] hehe [09:41:14] pybal unit test coverage just went full-green [09:41:19] on coveralls [09:41:27] and i have more stuff to merge [09:41:31] (but no time today ;) [09:42:08] approaching 100% coverage will help with python3 conversion [09:42:23] mark: cool :D [09:52:41] <_joe_> btw, we have to do an upgrade to etcd3 soon(TM) [10:23:24] vgutierrez: https://gerrit.wikimedia.org/r/#/c/424007/ could use your review at some point [10:25:36] yup [10:25:45] give me one sec, i'll submit the etcd patch and I'll review that [10:25:56] sure, no rush [10:26:19] as _joe_ suggested I mimicked python-etcd behavior [10:26:41] so waitIndex gets reset if we get a HTTP 400 with an 401 etcd error code [10:26:41] that seems wise [10:40:07] vgutierrez: you might want to take a look at other possible etcd errors [10:40:53] https://github.com/coreos/etcd/blob/master/etcdserver/v2error/error.go#L59-L61 (400: EcodeWatcherCleared, maybe?) [10:49:06] 401 behavior on the client side is documented here: https://coreos.com/etcd/docs/latest/v2/api.html [10:49:18] but I cannot find anything regarding 400 [10:49:24] but I think you're right ema [10:49:41] let's see if _joe_ can give us some input [10:50:00] EcodeWatcherCleared | 400 | "watcher is cleared due to etcd recovery" [10:50:53] ^ found on https://github.com/coreos/etcd/blob/master/Documentation/v2/errorcode.md [10:51:01] I meant I'm not able to find anything on how the client should react to a EcodeWatcherCleared error [11:20:29] bblack: minor vmod-netmapper spring cleaning: https://gerrit.wikimedia.org/r/#/c/428301/. While releasing the new version we should also s/unstable/jessie-wikimedia/ to make reprepro happy [12:13:01] heh I forgot to bump the version number of course! [12:27:02] sometimes that helps :) [12:28:09] ema: what's the intent with the patch w/ 400 ? [12:28:27] it seems like it will break a bunch of undesirable-but-working cases... [12:28:47] (and wikipedia.com isn't special, it's just the most-popular of a bunch of undesirable cases) [12:28:51] mmh [12:29:43] so the context is the RSA header discussion we had here on Friday [12:30:22] well, the right answer to the RSA thing is "wrap all of the X-CP related output code in an: if (X-Connection-Properties)" so that they're not set for HTTP reqs [12:30:49] (we only allow X-C-P to be set by nginx, not outsiders (even internal outsiders can't)) [12:31:42] if we want to m ake varnishlog parsing easier, we could also set some explicit header that indicates no-HTTPS, since it's hard to check for negatives [12:32:17] "X-CP-Proto: http|https", which is the only X-CP- that's set with X-C-P is missing? [13:29:31] 10netops, 10Operations: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317#4149725 (10faidon) I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different t... [13:30:52] 10netops, 10Operations: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323#4149726 (10faidon) Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide? [14:24:07] so, we're due to order some new cache hardware for eqiad [14:24:30] there's an easy and known path, where we keep ordering what we ordered last for eqsin+ulsfo, just with a slightly-larger machine count [14:24:57] (16 total hosts for 8-per-cluster for text/upload, instead of 2x6=12, because it's a 4-row core DC) [14:25:44] it's not unreasonable to just go down that path for now, and make time to rethink things later when we have time. either way we'll have N years of mixing this type of hw config with whatever comes next. [14:27:17] but I've already started having thoughts on "whatever comes next". Chief among them is the obvious: the more RAM we can get the better (we've moved up to 384G so far, even more would be even better), and of course faster disks are better too (e.g. move from nice SSDs to even-nicer NVMEs or Optane or whatever makes sense to get large volumes of faster-than-plain-SSD storage for the post-chashing [14:27:23] "disk" storage) [14:27:41] but another key consideration: it would be really nice to eliminate NUMA from the equasion. [14:28:21] (it makes all kinds of goofy hard-to-predict issues driven by NUMA go away at the kernel+hw+perf level) [14:29:23] there's a kind of conflict between that and the above: no NUMA means a server with just 1x CPU sockets (instead of 2x like we have today). And most standard servers from vendors don't offer a ton of DIMM slots on a single socket, so moving towards single-socket tends to also limit max possible RAM. [14:31:33] for Xeon-based machines, Dell has this problem (and I'd really like to stick with Dell at least for now if at all possible). Our closest thing would be to get the 2U servers we have now and yank out the second CPU and limit them to 256GB, which doesn't sound like a great solution. [14:31:46] s/2U/1U, 2-socket/ [14:32:41] thanks for working on that [14:33:04] what Dell does have that's tempting to experiment on, is some AMD options [14:33:39] using the new Zen/Epyc SP3 CPUs, and they have some single-socket servers with those that can go up to 512GB of RDIMM or 1TB of LRDIMM on memory. [14:33:54] they also have a ton of NVMe slots on those machines, you can put like 8+ NVMe drives in them [14:34:16] shiny [14:34:45] as I continue thinking out-loud though, I don't think we have time in this quarterly cycle to test such a machine and then decide on our order. and jumping to that without doing a test run of the hardware seems risky. [14:35:30] maybe we look at going forward with our current known-decent config for eqiad refresh [14:36:00] but maybe we should try to get a test box in soon as well (and then also look at how the ATS stuff is going on that test box too, since for a long-range hw changeup like this, that's what will eventually be running on it) [14:36:10] especially not since we need to get this ordered in the next few weeks to hit our deadline [15:55:17] <_joe_> I just noticed PyBal has no logo [15:55:28] <_joe_> so I can't add a shiny logo to our presentation at kubecon [15:56:10] maybe something with a python with multiple heads? or multiple pythons? that somehow looks loadbalancy :) [15:56:38] gently reminder: I'll reimage lvs5001 && lvs5002 to stretch after the weekly meeting, coinciding with the less traffic hours in eqsin [16:00:40] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4150330 (10BBlack) Are we good for handoff to #Traffic for OS-level install/config now on lvs1016? [16:38:10] hmmm I didn't do anything with lvs1016 cause I thought it still was on cmjohnson1's side [16:38:17] let me know if I misunderstood that :) [16:45:01] not sure yet, we'll find out [16:47:32] ack [16:47:46] let's give some love to eqsin loadbalancers on the meanwhile [16:48:02] after that I'll stall the issue till atop thingie is cleared out btw [16:52:48] ok [16:53:17] fwiw, I haven't seen evidence that -R is affecting them (which makes sense, they don't run tons of heavy productions processes with big memory usage) [16:53:29] it might be different in stretch cp-server case, which we haven't tried yet. [16:54:18] yep.. I've been monitoring that and I didn't see affectation.. but better safe than sorry regarding lvs instances :) [17:22:34] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150753 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5002.eqsin.wmnet ``` The log can be found in `/var/lo... [18:21:52] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150935 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5002.eqsin.wmnet'] ``` and were **ALL** successful. [18:35:46] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150992 (10Vgutierrez) [18:50:51] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151032 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5001.eqsin.wmnet ``` The log can be found in `/var/lo... [18:57:08] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4151056 (10LGoto) [19:43:20] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151265 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5001.eqsin.wmnet'] ``` and were **ALL** successful. [19:51:18] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151310 (10Vgutierrez) [22:17:48] 10netops, 10Operations: Enabling graceful-switchover causes core dumps on cr1-codfw - https://phabricator.wikimedia.org/T191371#4151763 (10ayounsi) Relevant KB entry: https://kb.juniper.net/InfoCenter/index?page=content&id=KB26616 JTAC's opinion on why it's working on some routers is that we're being "lucky".... [22:59:42] 10netops, 10Operations: asw1-eqsin vcp port flapping - https://phabricator.wikimedia.org/T192125#4151846 (10ayounsi) Disabled with: `ayounsi@asw1-eqsin# run request virtual-chassis vc-port set interface member 1 vcp-255/0/25 disable` Should be re-enabled with: `# run request virtual-chassis vc-port set interf...