502 Bad Gateway

[11:40:04] hashar: the Jenkins regression I mentioned recently (which breaks Jenkins if running on a Java with a four digit version scheme): https://issues.jenkins.io/browse/JENKINS-64212 is said to be fixed in the latest 2.263.1 LTS; [11:40:36] from https://issues.jenkins.io/browse/JENKINS-64212: Mark Waite added a comment - 3 days ago Included in weekly release Jenkins 2.268. Will be included in LTS release Jenkins 2.263.1. [12:56:51] moritzm: great, I guess we can catch up with Jenkins LTS version now :] [12:57:04] 2.263.1 got released today, I was waiting for that LTS to get out to upgrade our [12:57:18] cause last time we ended upgrading to some weekly release [12:57:22] shall I port 2.263.1 to apt.wikimedia.org? [12:57:30] shall I import 2.263.1 to apt.wikimedia.org? [12:58:33] moritzm: yes please, and there are the magic reprepro commands at https://wikitech.wikimedia.org/wiki/Jenkins#Get_the_package [12:58:56] using the magic -C thirdparty/ci --restrict=jenkins options [13:00:02] k, doing that in a bit [13:01:23] the alternative would be to use a thirdparty/jenkins component, but I can't remember why we are not doing that ;] [13:33:24] hello people, I am working with John (DCOps) to move servers between racks to free 10g spots, and the next inline would be conf1005 and conf1006 [13:33:37] (I already pinged Alex) [13:33:57] the zookeeper part is easy, but I have some doubts about the etcd part [13:34:09] more specifically [13:34:10] profile::etcd::replication::dst_url: https://conf1005.eqiad.wmnet:2379 [13:34:19] profile::pybal::config_host: conf1006.eqiad.wmnet [13:35:08] elukey: just different rack or also row (is any IP changing?) [13:37:21] nono same rack [13:37:24] err row [13:38:17] see also https://wikitech.wikimedia.org/wiki/Service_restarts#etcd [13:38:20] so profile::etcd::replication::active: true is active only for conf2002, so I guess conf1005 should be fine [13:38:34] thanks :) [13:38:40] ahahhah I wrote it [13:38:45] wrotfl [13:38:55] LMWTFY [13:39:03] I always say that myself from the past was wiser [13:40:24] o/ [13:40:28] o/ [13:40:47] ok, so pybal will continue working if the node it is contacting is down [13:40:58] but there is a catch [13:41:08] if it is restarted we would be in for a surprise [13:41:28] do you mean while conf1006 is down? [13:41:34] yes [13:42:07] so, depending on the amount of time the node is going to be down for, we either a) just do it, b) switch pybal to say conf1004 [13:42:42] the nodes are identical in nature, it should be just fine (FLW) [13:43:16] it should be around 10 mins probably [13:43:31] maybe we can switch pybal to 1004 just to be safe [13:43:49] you know what, let's just switch to 1004 [13:43:51] for conf1005 we should be ok, it is not replicating (2002 is from 1004) [13:43:55] yeah [13:44:00] I anyway need to bundle in an LVS removal [13:44:09] so now I got to reasons to restart pybal for [13:44:16] when is this due? [13:44:24] due for* [13:44:36] elukey: is the bit about the mirroring still valid in the sense that 1005 is special? [13:44:41] John it is in the DC now, we can skip entirely or say only do conf1005 [13:45:11] volans: so in theory the profile::etcd::replication::active: false prevents the etcd mirror from running (codfw -> eqiad) [13:45:47] ah no wait okok I got it [13:45:52] it could be a target for 2002 [13:46:10] etcdmirror-conftool-eqiad-wmnet.service on 2002 points to conf1004 [13:48:32] ah akosiaris conf1004 is also profile::pybal::config_host: conf1004.eqiad.wmnet, no problem I think but just mentioning [13:49:04] volans: so 1005 doesn't run a mirror and it is not a target for one afaics [13:49:13] great, thanks for checking [13:49:34] elukey@conf2002:~$ sudo netstat -tuap | grep conf1004 [13:49:35] tcp6 0 0 conf2002.codfw.wm:52728 conf1004.eqiad.wmn:4001 ESTABLISHED 7141/python [13:49:40] just to double check [13:50:07] (and 7141 is etcd mirror) [13:50:24] so to summarize: [13:50:30] - conf1005 seems doable anytime [13:50:48] - conf1006 needs some work beforehand (to move pybal to another target) [13:51:05] akosiaris: --^ is it a fair statement? [13:51:40] ah pybal in esams is conf1006 and pybal in eqiad conf1004 [13:51:49] yes [13:52:10] the SRV records don't need any change right? [13:52:14] yeah conf1005 seems doable anyway [13:52:17] no they don't [13:57:04] ok so I can ping John and do it later on (conf1005) [14:16:14] as FYI in ~15/20 mins I'll start the procedure to shutdown conf1005 [14:17:12] (from b4 to b3) [14:18:29] 1005 is the zookeeper leader but it will not be an issue [14:18:50] and etcdctl cluster-health looks good [14:33:43] akosiaris: starting to shutdown conf1005 as FYI [14:34:36] zookeeper down (new leader elected), etcd down [14:35:41] and etcd cluster health sees 1004 and 1006 as up [14:36:26] for the moment all good [14:37:30] ok [15:06:14] sigh too many channels [15:06:28] elukey: I got some confctl failures [15:06:35] given the other too are happy, is that normal ? [15:06:38] two* [15:09:16] effie: we moved conf1005 to another rack, it should be ok now (all good for both etcd and zookeeper0 [15:10:14] elukey: I got the errors 10' ago and the cluster looked healthy 40' ago [15:10:24] that's why I am wondering [15:10:28] ok I am running again [15:10:34] effie: can you give me more details ? :) [15:10:45] no [15:10:46] :p [15:11:00] so I run cumin on parsoid servers in eqiad [15:11:04] ahhahaha okok [15:11:08] depool ; something; pool [15:11:36] so the first 2 got stuck on pool [15:11:54] I got [15:11:55] 502 Bad Gateway [15:12:16] thinks seem to work now, you think it is worth digging a bit more? [15:12:25] I can have a look at the cumin logs [15:12:30] things* [15:13:23] nah not worth it [15:17:27] all right lemme know if anything is weird :) [15:17:38] thank you [16:16:53] just in case puppet wasn't ...fun... enough i just found that someone created a haskell implmentation http://lpuppet.banquise.net/blog/2012/07/04/the-puppet-resources-application/ [16:18:51] *phew*, makes you appreciate Ruby even :-) [16:19:21] :) yep [16:25:17] next step: port the puppet syntax to yaml [16:47:46] jbond42: oh I remember looking at that back in the day [16:49:05] paravoid: no envy here ;) [20:30:58] added new docs for "how to get a simple LAMP stack on a cloud VPS" (not related to prod appserver or beta, for volunteers and others who used the old "simplelamp" class to just get a quick LAMP but that got deprecated: https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_project#How_to_automatically_setup_a_simple_LAMP_server_on_a_cloud_VPS_instance